Re: Streaming Tagger

David '-1' Schmid Fri, 28 Feb 2020 05:26:23 -0800

I just wanted to pick this up, but somehow my JIRA account gotdeactivated. Once I have that figured out, I'll try to propose thechange. Thank you!


On 28.02.20 14:13, David Smiley wrote:

Thanks for your input David. I won't accept the patch because I thinkthere's a more appropriate way to go about this -- have the Taggerconstructor take an Analyzer instead of a TokenStream in theconstructor, and then have the process method take the InputStreamand/or string (the fundamental input to the tagger), thus allowingrepeated use of the same Tagger. It's been a long-standing FAQ: how doI tag in bulk, and this change would kind of help with that, at least ata low level which is your need. I'll filed a JIRA: SOLR-14292 -Refactor Tagger for re-use, thus aiding bulk-tagging<https://issues.apache.org/jira/browse/SOLR-14292> I don't plan ondoing this anytime soon so feel free to take it up if you wish.


~ David Smiley
Apache Lucene/Solr Search Developer
http://www.linkedin.com/in/davidwsmiley

On Fri, Feb 28, 2020 at 4:12 AM David '-1' Schmid<david.sch...@vis.uni-stuttgart.de<mailto:david.sch...@vis.uni-stuttgart.de>> wrote:


    On 27.02.20 19:01, David Smiley wrote:
     > I'm glad you got it working!  It's sad you felt the need to
    copy-paste
     > the tagger; perhaps you can recommend changes to make it more
    extensible
     > so that you or others needn't fork it.

    Don't need to feel sad, just as I mentioned: it's quick, dirty and I
    did
    not know better.
    I was wondering how to feed multiple Strings into the tagger w/o
    creating new instances of everything, but as I don't know much about
    how
    the tokenizers work, I just slapped everything together.

    I had planned to maybe use an InputStream that blocks once one string
    was exhausted, so I can feed the tags back into the stream and feed the
    InputStream new data, once TupleStream::read is called again.
    But since I wanted to get this done quickly, ... yeah. That happened.
    Not happy with it, but I learned a lot.

    I'm not sure if I'm qualified enough to recommend changes about the
    tagger. I'd maybe change the constructor to not accept a TokenStream,
    but just the configuration (reduce strategy, terms, ...). And provide a
    setter for the TokenStream. (patch attached)
    But that implies that a TokenStream is cheap to construct and use,
    which
    I don't know.

     > I'm not sure if something like this should be contributed back to
    Solr
     > itself.  I don't even know the bigger picture of why you are
    doing this,
     > so I am pessimistic :-).
    Which is completely fine :D
    Thank you for the guidance!

    best regards,
    David

     >
     > ~ David Smiley
     > Apache Lucene/Solr Search Developer
     > http://www.linkedin.com/in/davidwsmiley
     >
     >
     > On Thu, Feb 27, 2020 at 8:01 AM David '-1' Schmid
     > <david.sch...@vis.uni-stuttgart.de
    <mailto:david.sch...@vis.uni-stuttgart.de>
     > <mailto:david.sch...@vis.uni-stuttgart.de
    <mailto:david.sch...@vis.uni-stuttgart.de>>> wrote:
     >
     >     Hello again!
     >
     >     On 25.02.20 22:39, David Smiley wrote:
     >      > I haven't worked on streaming expressions yet but I did a
    little
     >     bit of
     >      > digging around.  I think the ClassifyStream might be somewhat
     >     similar to
     >      > learn from.  It takes a stream of docs, not unlike what you
     >     want.  And
     >      > crucially it implements setStreamContext with an
    implementation
     >     which
     >      > demonstrates how to get access to a SolrCore.  From a
    core, you
     >     can get
     >      > a SolrIndexSearcher. [...]
     >
     >     That worked beautifully! Or let's say: I got it working, the
    code is
     >     not
     >     beautiful, as is.
     >     Would this be interesting/relevant enough to be adopted upstream?
     >
     >     If so, should I open up a JIRA ticket?
     >
     >     best regards,
     >     David
     >
     >
     >
     >      > On Fri, Feb 21, 2020 at 8:05 AM David '-1' Schmid
     >      > <david.sch...@vis.uni-stuttgart.de
    <mailto:david.sch...@vis.uni-stuttgart.de>
     >     <mailto:david.sch...@vis.uni-stuttgart.de
    <mailto:david.sch...@vis.uni-stuttgart.de>>
     >      > <mailto:david.sch...@vis.uni-stuttgart.de
    <mailto:david.sch...@vis.uni-stuttgart.de>
     >     <mailto:david.sch...@vis.uni-stuttgart.de
    <mailto:david.sch...@vis.uni-stuttgart.de>>>> wrote:
     >      >
     >      >     Hello dear developers!
     >      >
     >      >     I've been wondering if I'd be able to adapt the current
     >      >     TaggerRequestHandler for using it within the /stream
    request
     >     handler.
     >      >
     >      >     Starting out is a tad confusing, which I expected
    since I have
     >      >     almost no
     >      >     experience with the solr/lucene codebase.
     >      >
     >      >     My goal is as follows: I want to use the result of a
    previous
     >      >     select(coll1, ...) as input for adding tags to the result
     >     document.
     >      >
     >      >     Possibly:
     >      >     tag(
     >      >         select(...), field_to_analyze_for_tags,
     >      >         collection_with_tag_dict, tag_dict_field,
     >      >         ... // remaining tagger configuration options
     >      >     )
     >      >
     >      >     I'm currently stuck at some steps in writing a
     >      >     'public class TaggerStream extends TupleStream implements
     >     Expressible'
     >      >     at two points:
     >      >
     >      >     == Problem 1: Getting 'terms' ==
     >      >
     >      >     The TaggerRequestHandler gets a SolrIndexSearcher via
    the request
     >      >
     >      >       > final SolrIndexSearcher searcher = req.getSearcher();
     >      >
     >      >     Which in turn is used to to acquire the terms
     >      >
     >      >       > Terms terms =
     >     searcher.getSlowAtomicReader().terms(indexedField);
     >      >
     >      >     which are used for tagging.
     >      >
     >      >     I've tried finding something that will yield the
    equivalent,
     >     but as you
     >      >     might have guessed: I didn't find anything so far.
     >      >
     >      >
     >      >     == Problem 2: Multiple Shards ==
     >      >
     >      >     I guess, this might come up sooner or later, hence this is
     >     related to
     >      >     SOLR-14190 (requesting the tagger to work across multiple
     >     shards).
     >      >     I suspect (mind: I really don't know) that acquiring the
     >     terms will
     >      >     have
     >      >     to do something with that, at least when we need to
    merge the
     >     results
     >      >     from multiple shards, but I have not yet found any
    code that
     >     does that.
     >      >     Might have been blinded by my confusion, tho.
     >      >
     >      >
     >      >     I'd be thankful if someone can help with any pointers
     >     regarding this.
     >      >
     >      >     best regards,
     >      >     David
     >      >
     >      >

> ---------------------------------------------------------------------

     >      >     To unsubscribe, e-mail:
    dev-unsubscr...@lucene.apache.org
    <mailto:dev-unsubscr...@lucene.apache.org>
     >     <mailto:dev-unsubscr...@lucene.apache.org
    <mailto:dev-unsubscr...@lucene.apache.org>>
     >      >     <mailto:dev-unsubscr...@lucene.apache.org
    <mailto:dev-unsubscr...@lucene.apache.org>
     >     <mailto:dev-unsubscr...@lucene.apache.org
    <mailto:dev-unsubscr...@lucene.apache.org>>>
     >      >     For additional commands, e-mail:
    dev-h...@lucene.apache.org <mailto:dev-h...@lucene.apache.org>
     >     <mailto:dev-h...@lucene.apache.org
    <mailto:dev-h...@lucene.apache.org>>
     >      >     <mailto:dev-h...@lucene.apache.org
    <mailto:dev-h...@lucene.apache.org>
     >     <mailto:dev-h...@lucene.apache.org
    <mailto:dev-h...@lucene.apache.org>>>
     >      >
     >

> ---------------------------------------------------------------------

     >     To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
    <mailto:dev-unsubscr...@lucene.apache.org>
     >     <mailto:dev-unsubscr...@lucene.apache.org
    <mailto:dev-unsubscr...@lucene.apache.org>>
     >     For additional commands, e-mail: dev-h...@lucene.apache.org
    <mailto:dev-h...@lucene.apache.org>
     >     <mailto:dev-h...@lucene.apache.org
    <mailto:dev-h...@lucene.apache.org>>
     >

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
    <mailto:dev-unsubscr...@lucene.apache.org>
    For additional commands, e-mail: dev-h...@lucene.apache.org
    <mailto:dev-h...@lucene.apache.org>


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: Streaming Tagger

Reply via email to