Re: Streaming Tagger

David '-1' Schmid Fri, 28 Feb 2020 01:12:43 -0800

On 27.02.20 19:01, David Smiley wrote:

I'm glad you got it working! It's sad you felt the need to copy-pastethe tagger; perhaps you can recommend changes to make it more extensibleso that you or others needn't fork it.

Don't need to feel sad, just as I mentioned: it's quick, dirty and I didnot know better.I was wondering how to feed multiple Strings into the tagger w/ocreating new instances of everything, but as I don't know much about howthe tokenizers work, I just slapped everything together.

I had planned to maybe use an InputStream that blocks once one stringwas exhausted, so I can feed the tags back into the stream and feed theInputStream new data, once TupleStream::read is called again.

But since I wanted to get this done quickly, ... yeah. That happened.
Not happy with it, but I learned a lot.

I'm not sure if I'm qualified enough to recommend changes about thetagger. I'd maybe change the constructor to not accept a TokenStream,but just the configuration (reduce strategy, terms, ...). And provide asetter for the TokenStream. (patch attached)But that implies that a TokenStream is cheap to construct and use, whichI don't know.

I'm not sure if something like this should be contributed back to Solritself. I don't even know the bigger picture of why you are doing this,so I am pessimistic :-).

Which is completely fine :D
Thank you for the guidance!

best regards,
David


~ David Smiley
Apache Lucene/Solr Search Developer
http://www.linkedin.com/in/davidwsmiley

On Thu, Feb 27, 2020 at 8:01 AM David '-1' Schmid<david.sch...@vis.uni-stuttgart.de<mailto:david.sch...@vis.uni-stuttgart.de>> wrote:


    Hello again!

    On 25.02.20 22:39, David Smiley wrote:
     > I haven't worked on streaming expressions yet but I did a little
    bit of
     > digging around.  I think the ClassifyStream might be somewhat
    similar to
     > learn from.  It takes a stream of docs, not unlike what you
    want.  And
     > crucially it implements setStreamContext with an implementation
    which
     > demonstrates how to get access to a SolrCore.  From a core, you
    can get
     > a SolrIndexSearcher. [...]

    That worked beautifully! Or let's say: I got it working, the code is
    not
    beautiful, as is.
    Would this be interesting/relevant enough to be adopted upstream?

    If so, should I open up a JIRA ticket?

    best regards,
    David



     > On Fri, Feb 21, 2020 at 8:05 AM David '-1' Schmid
     > <david.sch...@vis.uni-stuttgart.de
    <mailto:david.sch...@vis.uni-stuttgart.de>
     > <mailto:david.sch...@vis.uni-stuttgart.de
    <mailto:david.sch...@vis.uni-stuttgart.de>>> wrote:
     >
     >     Hello dear developers!
     >
     >     I've been wondering if I'd be able to adapt the current
     >     TaggerRequestHandler for using it within the /stream request
    handler.
     >
     >     Starting out is a tad confusing, which I expected since I have
     >     almost no
     >     experience with the solr/lucene codebase.
     >
     >     My goal is as follows: I want to use the result of a previous
     >     select(coll1, ...) as input for adding tags to the result
    document.
     >
     >     Possibly:
     >     tag(
     >         select(...), field_to_analyze_for_tags,
     >         collection_with_tag_dict, tag_dict_field,
     >         ... // remaining tagger configuration options
     >     )
     >
     >     I'm currently stuck at some steps in writing a
     >     'public class TaggerStream extends TupleStream implements
    Expressible'
     >     at two points:
     >
     >     == Problem 1: Getting 'terms' ==
     >
     >     The TaggerRequestHandler gets a SolrIndexSearcher via the request
     >
     >       > final SolrIndexSearcher searcher = req.getSearcher();
     >
     >     Which in turn is used to to acquire the terms
     >
     >       > Terms terms =
    searcher.getSlowAtomicReader().terms(indexedField);
     >
     >     which are used for tagging.
     >
     >     I've tried finding something that will yield the equivalent,
    but as you
     >     might have guessed: I didn't find anything so far.
     >
     >
     >     == Problem 2: Multiple Shards ==
     >
     >     I guess, this might come up sooner or later, hence this is
    related to
     >     SOLR-14190 (requesting the tagger to work across multiple
    shards).
     >     I suspect (mind: I really don't know) that acquiring the
    terms will
     >     have
     >     to do something with that, at least when we need to merge the
    results
     >     from multiple shards, but I have not yet found any code that
    does that.
     >     Might have been blinded by my confusion, tho.
     >
     >
     >     I'd be thankful if someone can help with any pointers
    regarding this.
     >
     >     best regards,
     >     David
     >

> ---------------------------------------------------------------------

     >     To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
    <mailto:dev-unsubscr...@lucene.apache.org>
     >     <mailto:dev-unsubscr...@lucene.apache.org
    <mailto:dev-unsubscr...@lucene.apache.org>>
     >     For additional commands, e-mail: dev-h...@lucene.apache.org
    <mailto:dev-h...@lucene.apache.org>
     >     <mailto:dev-h...@lucene.apache.org
    <mailto:dev-h...@lucene.apache.org>>
     >

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
    <mailto:dev-unsubscr...@lucene.apache.org>
    For additional commands, e-mail: dev-h...@lucene.apache.org
    <mailto:dev-h...@lucene.apache.org>

diff --git a/solr/core/src/java/org/apache/solr/handler/tagger/Tagger.java b/solr/core/src/java/org/apache/solr/handler/tagger/Tagger.java
index 12a4cf0a035..cdfecc52ffb 100644
--- a/solr/core/src/java/org/apache/solr/handler/tagger/Tagger.java
+++ b/solr/core/src/java/org/apache/solr/handler/tagger/Tagger.java
@@ -47,11 +47,11 @@ import org.slf4j.LoggerFactory;
 public abstract class Tagger {
   private static final Logger log = LoggerFactory.getLogger(MethodHandles.lookup().lookupClass());
 
-  private final TokenStream tokenStream;
-  private final TermToBytesRefAttribute byteRefAtt;
-  private final PositionIncrementAttribute posIncAtt;
-  private final OffsetAttribute offsetAtt;
-  private final TaggingAttribute taggingAtt;
+  private TokenStream tokenStream;
+  private TermToBytesRefAttribute byteRefAtt;
+  private PositionIncrementAttribute posIncAtt;
+  private OffsetAttribute offsetAtt;
+  private TaggingAttribute taggingAtt;
 
   private final TagClusterReducer tagClusterReducer;
   private final Terms terms;
@@ -81,6 +81,27 @@ public abstract class Tagger {
     this.tagClusterReducer = tagClusterReducer;
   }
 
+  public Tagger(Terms terms, Bits liveDocs,
+                TagClusterReducer tagClusterReducer, boolean skipAltTokens,
+                boolean ignoreStopWords) throws IOException {
+    this.terms = terms;
+    this.liveDocs = liveDocs;
+    this.skipAltTokens = skipAltTokens;
+    this.ignoreStopWords = ignoreStopWords;
+
+    this.tagClusterReducer = tagClusterReducer;
+  }
+
+  public void setTokenStream(TokenStream tokenStream) throws IOException {
+    this.tokenStream = tokenStream;
+
+    byteRefAtt = tokenStream.addAttribute(TermToBytesRefAttribute.class);
+    posIncAtt = tokenStream.addAttribute(PositionIncrementAttribute.class);
+    offsetAtt = tokenStream.addAttribute(OffsetAttribute.class);
+    taggingAtt = tokenStream.addAttribute(TaggingAttribute.class);
+    tokenStream.reset();
+  }
+
   public void enableDocIdsCache(int initSize) {
     if (initSize > 0)
       docIdsCache = new HashMap<>(initSize);
@@ -89,6 +110,10 @@ public abstract class Tagger {
   public void process() throws IOException {
     if (terms == null)
       return;
+    if (null == tokenStream) {
+      // throw IllegalStateException instead?
+      return;
+    }
 
     //a shared pointer to the head used by this method and each Tag instance.
     final TagLL[] head = new TagLL[1];

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: Streaming Tagger

Reply via email to