On 27.02.20 19:01, David Smiley wrote:
I'm glad you got it working! It's sad you felt the need to copy-paste
the tagger; perhaps you can recommend changes to make it more extensible
so that you or others needn't fork it.
Don't need to feel sad, just as I mentioned: it's quick, dirty and I did
not know better.
I was wondering how to feed multiple Strings into the tagger w/o
creating new instances of everything, but as I don't know much about how
the tokenizers work, I just slapped everything together.
I had planned to maybe use an InputStream that blocks once one string
was exhausted, so I can feed the tags back into the stream and feed the
InputStream new data, once TupleStream::read is called again.
But since I wanted to get this done quickly, ... yeah. That happened.
Not happy with it, but I learned a lot.
I'm not sure if I'm qualified enough to recommend changes about the
tagger. I'd maybe change the constructor to not accept a TokenStream,
but just the configuration (reduce strategy, terms, ...). And provide a
setter for the TokenStream. (patch attached)
But that implies that a TokenStream is cheap to construct and use, which
I don't know.
I'm not sure if something like this should be contributed back to Solr
itself. I don't even know the bigger picture of why you are doing this,
so I am pessimistic :-).
Which is completely fine :D
Thank you for the guidance!
best regards,
David
~ David Smiley
Apache Lucene/Solr Search Developer
http://www.linkedin.com/in/davidwsmiley
On Thu, Feb 27, 2020 at 8:01 AM David '-1' Schmid
<david.sch...@vis.uni-stuttgart.de
<mailto:david.sch...@vis.uni-stuttgart.de>> wrote:
Hello again!
On 25.02.20 22:39, David Smiley wrote:
> I haven't worked on streaming expressions yet but I did a little
bit of
> digging around. I think the ClassifyStream might be somewhat
similar to
> learn from. It takes a stream of docs, not unlike what you
want. And
> crucially it implements setStreamContext with an implementation
which
> demonstrates how to get access to a SolrCore. From a core, you
can get
> a SolrIndexSearcher. [...]
That worked beautifully! Or let's say: I got it working, the code is
not
beautiful, as is.
Would this be interesting/relevant enough to be adopted upstream?
If so, should I open up a JIRA ticket?
best regards,
David
> On Fri, Feb 21, 2020 at 8:05 AM David '-1' Schmid
> <david.sch...@vis.uni-stuttgart.de
<mailto:david.sch...@vis.uni-stuttgart.de>
> <mailto:david.sch...@vis.uni-stuttgart.de
<mailto:david.sch...@vis.uni-stuttgart.de>>> wrote:
>
> Hello dear developers!
>
> I've been wondering if I'd be able to adapt the current
> TaggerRequestHandler for using it within the /stream request
handler.
>
> Starting out is a tad confusing, which I expected since I have
> almost no
> experience with the solr/lucene codebase.
>
> My goal is as follows: I want to use the result of a previous
> select(coll1, ...) as input for adding tags to the result
document.
>
> Possibly:
> tag(
> select(...), field_to_analyze_for_tags,
> collection_with_tag_dict, tag_dict_field,
> ... // remaining tagger configuration options
> )
>
> I'm currently stuck at some steps in writing a
> 'public class TaggerStream extends TupleStream implements
Expressible'
> at two points:
>
> == Problem 1: Getting 'terms' ==
>
> The TaggerRequestHandler gets a SolrIndexSearcher via the request
>
> > final SolrIndexSearcher searcher = req.getSearcher();
>
> Which in turn is used to to acquire the terms
>
> > Terms terms =
searcher.getSlowAtomicReader().terms(indexedField);
>
> which are used for tagging.
>
> I've tried finding something that will yield the equivalent,
but as you
> might have guessed: I didn't find anything so far.
>
>
> == Problem 2: Multiple Shards ==
>
> I guess, this might come up sooner or later, hence this is
related to
> SOLR-14190 (requesting the tagger to work across multiple
shards).
> I suspect (mind: I really don't know) that acquiring the
terms will
> have
> to do something with that, at least when we need to merge the
results
> from multiple shards, but I have not yet found any code that
does that.
> Might have been blinded by my confusion, tho.
>
>
> I'd be thankful if someone can help with any pointers
regarding this.
>
> best regards,
> David
>
>
---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
<mailto:dev-unsubscr...@lucene.apache.org>
> <mailto:dev-unsubscr...@lucene.apache.org
<mailto:dev-unsubscr...@lucene.apache.org>>
> For additional commands, e-mail: dev-h...@lucene.apache.org
<mailto:dev-h...@lucene.apache.org>
> <mailto:dev-h...@lucene.apache.org
<mailto:dev-h...@lucene.apache.org>>
>
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
<mailto:dev-unsubscr...@lucene.apache.org>
For additional commands, e-mail: dev-h...@lucene.apache.org
<mailto:dev-h...@lucene.apache.org>
diff --git a/solr/core/src/java/org/apache/solr/handler/tagger/Tagger.java b/solr/core/src/java/org/apache/solr/handler/tagger/Tagger.java
index 12a4cf0a035..cdfecc52ffb 100644
--- a/solr/core/src/java/org/apache/solr/handler/tagger/Tagger.java
+++ b/solr/core/src/java/org/apache/solr/handler/tagger/Tagger.java
@@ -47,11 +47,11 @@ import org.slf4j.LoggerFactory;
public abstract class Tagger {
private static final Logger log = LoggerFactory.getLogger(MethodHandles.lookup().lookupClass());
- private final TokenStream tokenStream;
- private final TermToBytesRefAttribute byteRefAtt;
- private final PositionIncrementAttribute posIncAtt;
- private final OffsetAttribute offsetAtt;
- private final TaggingAttribute taggingAtt;
+ private TokenStream tokenStream;
+ private TermToBytesRefAttribute byteRefAtt;
+ private PositionIncrementAttribute posIncAtt;
+ private OffsetAttribute offsetAtt;
+ private TaggingAttribute taggingAtt;
private final TagClusterReducer tagClusterReducer;
private final Terms terms;
@@ -81,6 +81,27 @@ public abstract class Tagger {
this.tagClusterReducer = tagClusterReducer;
}
+ public Tagger(Terms terms, Bits liveDocs,
+ TagClusterReducer tagClusterReducer, boolean skipAltTokens,
+ boolean ignoreStopWords) throws IOException {
+ this.terms = terms;
+ this.liveDocs = liveDocs;
+ this.skipAltTokens = skipAltTokens;
+ this.ignoreStopWords = ignoreStopWords;
+
+ this.tagClusterReducer = tagClusterReducer;
+ }
+
+ public void setTokenStream(TokenStream tokenStream) throws IOException {
+ this.tokenStream = tokenStream;
+
+ byteRefAtt = tokenStream.addAttribute(TermToBytesRefAttribute.class);
+ posIncAtt = tokenStream.addAttribute(PositionIncrementAttribute.class);
+ offsetAtt = tokenStream.addAttribute(OffsetAttribute.class);
+ taggingAtt = tokenStream.addAttribute(TaggingAttribute.class);
+ tokenStream.reset();
+ }
+
public void enableDocIdsCache(int initSize) {
if (initSize > 0)
docIdsCache = new HashMap<>(initSize);
@@ -89,6 +110,10 @@ public abstract class Tagger {
public void process() throws IOException {
if (terms == null)
return;
+ if (null == tokenStream) {
+ // throw IllegalStateException instead?
+ return;
+ }
//a shared pointer to the head used by this method and each Tag instance.
final TagLL[] head = new TagLL[1];
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org