Re: Consuming token stream more than once in same filter

Michael Sokolov Thu, 31 Oct 2019 01:54:45 -0700

Are you able to:
1) create a custom attribute encoding the language
2) create a filter that sets the attribute when it reads the first token
3) wrap your synonym filters (one for each language) in a
ConditionalTokenFilter that filters based on the language attribute


On Wed, Oct 30, 2019 at 11:16 PM Shyamsunder Mutcha <sjh...@gmail.com> wrote:
>
> I have a requirement to handle synonyms differently based on the first word 
> (token) in the text field of the document. I have implemented custom 
> SynFilterFactory which loads synonyms per languages when core/solr is started.
>
> Now in the MySynonymFilterFactory#create(TokenStream input) method, I have to 
> read the first token from the input TokenStream. Based on that token value, 
> corresponding SynonymMap will be used for SynonymFilter creation.
>
> Here are my documents
> doc1 <text>lang_eng this is English language text</text>
> doc2 <text>lang_fra this is French language text</text>
> doc3 <text>lang_spa this is Spanish language text</text>
>
> MySynonymFilterFactory creates MySynonymFilter. Method create() logic is 
> below...
>
> @Override
>
> public TokenStream create(TokenStream input) {
>
> // if the fst is null, it means there's actually no synonyms... just return 
> the
>
> // original stream as there is nothing to do here.
>
> // return map.fst == null ? input : new MySynonymFilter(input, map, 
> ignoreCase);
>
> System.out.println("input=" + input);
>
> // some how read the TokenStream here to capture the lang value
>
> SynonymMap synonyms = null;
>
> try {
>
> CharTermAttribute termAtt = input.addAttribute(CharTermAttribute.class);
>
> boolean first = false;
>
> input.reset();
>
> while (!first && input.incrementToken()) {
>
> String term = new String(termAtt.buffer(), 0, termAtt.length());
>
> System.out.println("termAtt=" + term);
>
> if (StringUtils.startsWith(term, "lang_")) {
>
> String[] split = StringUtils.split(term, "_");
>
> String lang = split[1];
>
> String key = (langSynMap.containsKey(lang)) ? lang : "generic";
>
> synonyms = langSynMap.get(key);
>
> System.out.println("synonyms=" + synonyms);
>
> }
>
> first = true;
>
> }
>
> } catch (IOException e) {
>
> // TODO Auto-generated catch block
>
> e.printStackTrace();
>
> }
>
>
> return synonyms == null ? input : new SynonymFilter(input, synonyms, 
> ignoreCase);
>
> }
>
>
> This code compiles and this new analysis works fine in the Solr admin 
> analysis screen. But same fails with below exception when I try to index a 
> document
> 30273 ERROR (qtp1689843956-18) [   x:gcom] o.a.s.h.RequestHandlerBase 
> org.apache.solr.common.SolrException: Exception writing document id id1 to 
> the index; possible analysis error.
>         at 
> org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:180)
>         at 
> org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:68)
>         at 
> org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:48)
>         at 
> org.apache.solr.update.processor.DistributedUpdateProcessor.doLocalAdd(DistributedUpdateProcessor.java:934)
>         at 
> org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(DistributedUpdateProcessor.java:1089)
>         at 
> org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:712)
>         at 
> org.apache.solr.update.processor.LogUpdateProcessorFactory$LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:103)
> Caused by: java.lang.IllegalStateException: TokenStream contract violation: 
> reset()/close() call missing, reset() called multiple times, or subclass does 
> not call super.reset(). Please see Java
> docs of TokenStream class for more information about the correct consuming 
> workflow.
>         at org.apache.lucene.analysis.Tokenizer$1.read(Tokenizer.java:109)
>         at 
> org.apache.lucene.analysis.standard.StandardTokenizerImpl.zzRefill(StandardTokenizerImpl.java:527)
>         at 
> org.apache.lucene.analysis.standard.StandardTokenizerImpl.getNextToken(StandardTokenizerImpl.java:738)
>         at 
> org.apache.lucene.analysis.standard.StandardTokenizer.incrementToken(StandardTokenizer.java:159)
>         at 
> com.synonyms.poc.synpoc.MySynonymFilterFactory.create(MySynonymFilterFactory.java:94)
>         at 
> org.apache.solr.analysis.TokenizerChain.createComponents(TokenizerChain.java:91)
>         at 
> org.apache.lucene.analysis.AnalyzerWrapper.createComponents(AnalyzerWrapper.java:101)
>         at 
> org.apache.lucene.analysis.AnalyzerWrapper.createComponents(AnalyzerWrapper.java:101)
>         at org.apache.lucene.analysis.Analyzer.tokenStream(Analyzer.java:176)
>         at org.apache.lucene.document.Field.tokenStream(Field.java:562)
>         at 
> org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:628)
>         at 
> org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:365)
>         at 
> org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:321)
>         at 
> org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:234)
>         at 
> org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:450)
>         at 
> org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1477)
>         at 
> org.apache.solr.update.DirectUpdateHandler2.doNormalUpdate(DirectUpdateHandler2.java:282)
>         at 
> org.apache.solr.update.DirectUpdateHandler2.addDoc0(DirectUpdateHandler2.java:214)
>         at 
> org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:169)
>         ... 37 more
>
> Any idea how can I read a token stream with out violating the token stream 
> contract. I see a similar discussion here 
> https://lucene.472066.n3.nabble.com/how-to-reuse-a-tokenStream-td850767.html, 
> but doesn't help solve my problem.
>
> Also how come same error is not reported when analyzing the field value using 
> Solr admin console analysis screen.
>
> Thanks

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: Consuming token stream more than once in same filter

Reply via email to