[jira] [Commented] (LUCENE-3326) MoreLikeThis reuses a reader after it has already closed it
[ https://issues.apache.org/jira/browse/LUCENE-3326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13072342#comment-13072342 ] Carl Austin commented on LUCENE-3326: - In case you were unaware (as the JIRA says affects 3.3) this also affects 3.2 as I have just reproduced it. Thanks. MoreLikeThis reuses a reader after it has already closed it --- Key: LUCENE-3326 URL: https://issues.apache.org/jira/browse/LUCENE-3326 Project: Lucene - Java Issue Type: Bug Components: modules/other Affects Versions: 3.3 Reporter: Trejkaz Fix For: 3.4, 4.0 Attachments: LUCENE-3326.patch MoreLikeThis has a fatal bug whereby it tries to reuse a reader for multiple fields: {code} MapString,Int words = new HashMapString,Int(); for (int i = 0; i fieldNames.length; i++) { String fieldName = fieldNames[i]; addTermFrequencies(r, words, fieldName); } {code} However, addTermFrequencies() is creating a TokenStream for this reader: {code} TokenStream ts = analyzer.reusableTokenStream(fieldName, r); int tokenCount=0; // for every token CharTermAttribute termAtt = ts.addAttribute(CharTermAttribute.class); ts.reset(); while (ts.incrementToken()) { /* body omitted */ } ts.end(); ts.close(); {code} When it closes this analyser, it closes the underlying reader. Then the second time around the loop, you get: {noformat} Caused by: java.io.IOException: Stream closed at sun.nio.cs.StreamDecoder.ensureOpen(StreamDecoder.java:27) at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:128) at java.io.InputStreamReader.read(InputStreamReader.java:167) at com.acme.util.CompositeReader.read(CompositeReader.java:101) at org.apache.lucene.analysis.standard.StandardTokenizerImpl.zzRefill(StandardTokenizerImpl.java:803) at org.apache.lucene.analysis.standard.StandardTokenizerImpl.getNextToken(StandardTokenizerImpl.java:1010) at org.apache.lucene.analysis.standard.StandardTokenizer.incrementToken(StandardTokenizer.java:178) at org.apache.lucene.analysis.standard.StandardFilter.incrementTokenClassic(StandardFilter.java:61) at org.apache.lucene.analysis.standard.StandardFilter.incrementToken(StandardFilter.java:57) at com.acme.storage.index.analyser.NormaliseFilter.incrementToken(NormaliseFilter.java:51) at org.apache.lucene.analysis.LowerCaseFilter.incrementToken(LowerCaseFilter.java:60) at org.apache.lucene.search.similar.MoreLikeThis.addTermFrequencies(MoreLikeThis.java:931) at org.apache.lucene.search.similar.MoreLikeThis.retrieveTerms(MoreLikeThis.java:1003) at org.apache.lucene.search.similar.MoreLikeThis.retrieveInterestingTerms(MoreLikeThis.java:1036) {noformat} My first thought was that it seems like a ReaderFactory of sorts should be passed in so that a new Reader can be created for the second field (maybe the factory could be passed the field name, so that if someone wanted to pass a different reader to each, they could.) Interestingly, the methods taking File and URL exhibit the same issue. I'm not sure what to do about those (and we're not using them.) The method taking File could open the file twice, but the method taking a URL probably shouldn't fetch the same URL twice. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-1720) TimeLimitedIndexReader and associated utility class
[ https://issues.apache.org/jira/browse/LUCENE-1720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13085734#comment-13085734 ] Carl Austin commented on LUCENE-1720: - Lucene 3.3 changed scoring in TermQuery (and others?) which means that TimeLimitedIndexReader breaks query scoring, returning all scores as 0. This is because hash codes for the subreaders are stored and reused. Unfortunately the getSequentialSubReaders uses a new wrapping object each time and doesn't implement hash code in anyway, hence returning oid's. In my case I just implemented hashcode in the TimeLimitedIndexReader, returning in.hashcode() and this fixed the problem. TimeLimitedIndexReader and associated utility class --- Key: LUCENE-1720 URL: https://issues.apache.org/jira/browse/LUCENE-1720 Project: Lucene - Java Issue Type: New Feature Components: core/index Reporter: Mark Harwood Assignee: Mark Harwood Priority: Minor Attachments: ActivityTimeMonitor.java, ActivityTimeMonitor.java, ActivityTimeMonitor.java, ActivityTimedOutException.java, LUCENE-1720.patch, LUCENE-1720.patch, LUCENE-1720.patch, Lucene-1720.patch, Lucene-1720.patch, TestTimeLimitedIndexReader.java, TestTimeLimitedIndexReader.java, TimeLimitedIndexReader.java, TimeLimitedIndexReader.java An alternative to TimeLimitedCollector that has the following advantages: 1) Any reader activity can be time-limited rather than just single searches e.g. the document retrieve phase. 2) Times out faster (i.e. runaway queries such as fuzzies detected quickly before last collect stage of query processing) Uses new utility timeout class that is independent of IndexReader. Initial contribution includes a performance test class but not had time as yet to work up a formal Junit test. TimeLimitedIndexReader is coded as JDK1.5 but can easily be undone. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4190) IndexWriter deletes non-Lucene files
[ https://issues.apache.org/jira/browse/LUCENE-4190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13407001#comment-13407001 ] Carl Austin commented on LUCENE-4190: - I was the original commenter on the blog about this issue and have previously experienced the deletion of all files on a drive because of the exact same restriction - the fallout from this is massive. The issue here is that many people who use lucene will not realise that this can happen, and this situation will occur sooner or later. You can't expect that every developer who uses lucene will understand every in and out, read every bit of javadoc fully or every release change note. Look at the number of posts to the mailing list that are just people who haven't fully read or understood something. I firmly believe that this has to be handled by the library such that a simple mistake or misunderstanding by a developer does not lead to the loss of important files. IndexWriter deletes non-Lucene files Key: LUCENE-4190 URL: https://issues.apache.org/jira/browse/LUCENE-4190 Project: Lucene - Java Issue Type: Bug Reporter: Michael McCandless Assignee: Robert Muir Fix For: 4.0, 5.0 Attachments: LUCENE-4190.patch, LUCENE-4190.patch Carl Austin raised a good issue in a comment on my Lucene 4.0.0 alpha blog post: http://blog.mikemccandless.com/2012/07/lucene-400-alpha-at-long-last.html IndexWriter will now (as of 4.0) delete all foreign files from the index directory. We made this change because Codecs are free to write to any files now, so the space of filenames is hard to bound. But if the user accidentally uses the wrong directory (eg c:/) then we will in fact delete important stuff. I think we can at least use some simple criteria (must start with _, maybe must fit certain pattern eg _base36(_X).Y), so we are much less likely to delete a non-Lucene file -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1690) Morelikethis queries are very slow compared to other search types
[ https://issues.apache.org/jira/browse/LUCENE-1690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12733234#action_12733234 ] Carl Austin commented on LUCENE-1690: - The cache used for this is a HashMap and this is unbounded. Perhaps this should be an LRU cache with a settable maximum number of entries to stop it growing forever if you do a lot of like this queries on large indexes with many unique terms. Otherwise nice addition, has sped up my more like this queries a bit. Morelikethis queries are very slow compared to other search types - Key: LUCENE-1690 URL: https://issues.apache.org/jira/browse/LUCENE-1690 Project: Lucene - Java Issue Type: Improvement Components: contrib/* Affects Versions: 2.4.1 Reporter: Richard Marr Priority: Minor Attachments: LUCENE-1690.patch Original Estimate: 2h Remaining Estimate: 2h The MoreLikeThis object performs term frequency lookups for every query. From my testing that's what seems to take up the majority of time for MoreLikeThis searches. For some (I'd venture many) applications it's not necessary for term statistics to be looked up every time. A fairly naive opt-in caching mechanism tied to the life of the MoreLikeThis object would allow applications to cache term statistics for the duration that suits them. I've got this working in my test code. I'll put together a patch file when I get a minute. From my testing this can improve performance by a factor of around 10. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1690) Morelikethis queries are very slow compared to other search types
[ https://issues.apache.org/jira/browse/LUCENE-1690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12733238#action_12733238 ] Carl Austin commented on LUCENE-1690: - I wasn't all that scientific I am afraid, just noting that it improved performace enough once warmed up to keep on using it. Sorry. However, after just 3 or 4 more like this queries I am seeing a definate improvement, as the majority of freetext is standard vocab, and the unique terms only make up a small amount of the rest of the text. Morelikethis queries are very slow compared to other search types - Key: LUCENE-1690 URL: https://issues.apache.org/jira/browse/LUCENE-1690 Project: Lucene - Java Issue Type: Improvement Components: contrib/* Affects Versions: 2.4.1 Reporter: Richard Marr Priority: Minor Attachments: LUCENE-1690.patch Original Estimate: 2h Remaining Estimate: 2h The MoreLikeThis object performs term frequency lookups for every query. From my testing that's what seems to take up the majority of time for MoreLikeThis searches. For some (I'd venture many) applications it's not necessary for term statistics to be looked up every time. A fairly naive opt-in caching mechanism tied to the life of the MoreLikeThis object would allow applications to cache term statistics for the duration that suits them. I've got this working in my test code. I'll put together a patch file when I get a minute. From my testing this can improve performance by a factor of around 10. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1690) Morelikethis queries are very slow compared to other search types
[ https://issues.apache.org/jira/browse/LUCENE-1690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12737107#action_12737107 ] Carl Austin commented on LUCENE-1690: - The cache in terminfosreader is for everything as you say. I do a lot of stuff with terms, and those terms will get pushed out of this LRU cache very quickly. I have a separate cache on my version of the MLT. This has the advantage of those terms only being pushed out by other MLT queries, and not by everything else I am doing that is not MLT related. A lot of MLTs use the same terms, and I have a good size cache for it, meaning most terms I use in MLT can be retrieved from there. Seeing as MLT in my circumstance is one of the slower bits, this can give me a good advantage. Morelikethis queries are very slow compared to other search types - Key: LUCENE-1690 URL: https://issues.apache.org/jira/browse/LUCENE-1690 Project: Lucene - Java Issue Type: Improvement Components: contrib/* Affects Versions: 2.4.1 Reporter: Richard Marr Priority: Minor Attachments: LruCache.patch, LUCENE-1690.patch, LUCENE-1690.patch Original Estimate: 2h Remaining Estimate: 2h The MoreLikeThis object performs term frequency lookups for every query. From my testing that's what seems to take up the majority of time for MoreLikeThis searches. For some (I'd venture many) applications it's not necessary for term statistics to be looked up every time. A fairly naive opt-in caching mechanism tied to the life of the MoreLikeThis object would allow applications to cache term statistics for the duration that suits them. I've got this working in my test code. I'll put together a patch file when I get a minute. From my testing this can improve performance by a factor of around 10. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org