[ 
https://issues.apache.org/jira/browse/LUCENE-3326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13066833#comment-13066833
 ] 

Robert Muir commented on LUCENE-3326:
-------------------------------------

the logic of this class is broken: the field parameter taken here is just to 
specify the fieldname passed to the Analyzer when analyzing the tokens (e.g. 
some analyzers behave differently depending upon field).

There should be no loop across fields... instead you should be forced to 
provide this fieldname as an argument if you pass in a reader (analyze this 
content with my analyzer using field X)

As far as I can tell, this has been broken for a long time: if your analyzer 
works the same way across all fields you would previously never notice a 
problem, because it would analyze the text with the "first" one, but didnt 
close the reader, passing an exhausted reader across other field names.

> MoreLikeThis reuses a reader after it has already closed it
> -----------------------------------------------------------
>
>                 Key: LUCENE-3326
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3326
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: modules/other
>    Affects Versions: 3.3
>            Reporter: Trejkaz
>
> MoreLikeThis has a fatal bug whereby it tries to reuse a reader for multiple 
> fields:
> {code}
>     Map<String,Int> words = new HashMap<String,Int>();
>     for (int i = 0; i < fieldNames.length; i++) {
>         String fieldName = fieldNames[i];
>         addTermFrequencies(r, words, fieldName);
>     }
> {code}
> However, addTermFrequencies() is creating a TokenStream for this reader:
> {code}
>     TokenStream ts = analyzer.reusableTokenStream(fieldName, r);
>     int tokenCount=0;
>     // for every token
>     CharTermAttribute termAtt = ts.addAttribute(CharTermAttribute.class);
>     ts.reset();
>     while (ts.incrementToken()) {
>         /* body omitted */
>     }
>     ts.end();
>     ts.close();
> {code}
> When it closes this analyser, it closes the underlying reader.  Then the 
> second time around the loop, you get:
> {noformat}
> Caused by: java.io.IOException: Stream closed
>       at sun.nio.cs.StreamDecoder.ensureOpen(StreamDecoder.java:27)
>       at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:128)
>       at java.io.InputStreamReader.read(InputStreamReader.java:167)
>       at com.acme.util.CompositeReader.read(CompositeReader.java:101)
>       at 
> org.apache.lucene.analysis.standard.StandardTokenizerImpl.zzRefill(StandardTokenizerImpl.java:803)
>       at 
> org.apache.lucene.analysis.standard.StandardTokenizerImpl.getNextToken(StandardTokenizerImpl.java:1010)
>       at 
> org.apache.lucene.analysis.standard.StandardTokenizer.incrementToken(StandardTokenizer.java:178)
>       at 
> org.apache.lucene.analysis.standard.StandardFilter.incrementTokenClassic(StandardFilter.java:61)
>       at 
> org.apache.lucene.analysis.standard.StandardFilter.incrementToken(StandardFilter.java:57)
>       at 
> com.acme.storage.index.analyser.NormaliseFilter.incrementToken(NormaliseFilter.java:51)
>       at 
> org.apache.lucene.analysis.LowerCaseFilter.incrementToken(LowerCaseFilter.java:60)
>       at 
> org.apache.lucene.search.similar.MoreLikeThis.addTermFrequencies(MoreLikeThis.java:931)
>       at 
> org.apache.lucene.search.similar.MoreLikeThis.retrieveTerms(MoreLikeThis.java:1003)
>       at 
> org.apache.lucene.search.similar.MoreLikeThis.retrieveInterestingTerms(MoreLikeThis.java:1036)
> {noformat}
> My first thought was that it seems like a "ReaderFactory" of sorts should be 
> passed in so that a new Reader can be created for the second field (maybe the 
> factory could be passed the field name, so that if someone wanted to pass a 
> different reader to each, they could.)
> Interestingly, the methods taking File and URL exhibit the same issue.  I'm 
> not sure what to do about those (and we're not using them.)  The method 
> taking File could open the file twice, but the method taking a URL probably 
> shouldn't fetch the same URL twice.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to