[jira] [Commented] (LUCENE-3326) MoreLikeThis reuses a reader after it has already closed it

2011-07-28 Thread Carl Austin (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13072342#comment-13072342
 ] 

Carl Austin commented on LUCENE-3326:
-

In case you were unaware (as the JIRA says affects 3.3) this also affects 3.2 
as I have just reproduced it.
Thanks.

 MoreLikeThis reuses a reader after it has already closed it
 ---

 Key: LUCENE-3326
 URL: https://issues.apache.org/jira/browse/LUCENE-3326
 Project: Lucene - Java
  Issue Type: Bug
  Components: modules/other
Affects Versions: 3.3
Reporter: Trejkaz
 Fix For: 3.4, 4.0

 Attachments: LUCENE-3326.patch


 MoreLikeThis has a fatal bug whereby it tries to reuse a reader for multiple 
 fields:
 {code}
 MapString,Int words = new HashMapString,Int();
 for (int i = 0; i  fieldNames.length; i++) {
 String fieldName = fieldNames[i];
 addTermFrequencies(r, words, fieldName);
 }
 {code}
 However, addTermFrequencies() is creating a TokenStream for this reader:
 {code}
 TokenStream ts = analyzer.reusableTokenStream(fieldName, r);
 int tokenCount=0;
 // for every token
 CharTermAttribute termAtt = ts.addAttribute(CharTermAttribute.class);
 ts.reset();
 while (ts.incrementToken()) {
 /* body omitted */
 }
 ts.end();
 ts.close();
 {code}
 When it closes this analyser, it closes the underlying reader.  Then the 
 second time around the loop, you get:
 {noformat}
 Caused by: java.io.IOException: Stream closed
   at sun.nio.cs.StreamDecoder.ensureOpen(StreamDecoder.java:27)
   at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:128)
   at java.io.InputStreamReader.read(InputStreamReader.java:167)
   at com.acme.util.CompositeReader.read(CompositeReader.java:101)
   at 
 org.apache.lucene.analysis.standard.StandardTokenizerImpl.zzRefill(StandardTokenizerImpl.java:803)
   at 
 org.apache.lucene.analysis.standard.StandardTokenizerImpl.getNextToken(StandardTokenizerImpl.java:1010)
   at 
 org.apache.lucene.analysis.standard.StandardTokenizer.incrementToken(StandardTokenizer.java:178)
   at 
 org.apache.lucene.analysis.standard.StandardFilter.incrementTokenClassic(StandardFilter.java:61)
   at 
 org.apache.lucene.analysis.standard.StandardFilter.incrementToken(StandardFilter.java:57)
   at 
 com.acme.storage.index.analyser.NormaliseFilter.incrementToken(NormaliseFilter.java:51)
   at 
 org.apache.lucene.analysis.LowerCaseFilter.incrementToken(LowerCaseFilter.java:60)
   at 
 org.apache.lucene.search.similar.MoreLikeThis.addTermFrequencies(MoreLikeThis.java:931)
   at 
 org.apache.lucene.search.similar.MoreLikeThis.retrieveTerms(MoreLikeThis.java:1003)
   at 
 org.apache.lucene.search.similar.MoreLikeThis.retrieveInterestingTerms(MoreLikeThis.java:1036)
 {noformat}
 My first thought was that it seems like a ReaderFactory of sorts should be 
 passed in so that a new Reader can be created for the second field (maybe the 
 factory could be passed the field name, so that if someone wanted to pass a 
 different reader to each, they could.)
 Interestingly, the methods taking File and URL exhibit the same issue.  I'm 
 not sure what to do about those (and we're not using them.)  The method 
 taking File could open the file twice, but the method taking a URL probably 
 shouldn't fetch the same URL twice.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-1720) TimeLimitedIndexReader and associated utility class

2011-08-16 Thread Carl Austin (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13085734#comment-13085734
 ] 

Carl Austin commented on LUCENE-1720:
-

Lucene 3.3 changed scoring in TermQuery (and others?) which means that 
TimeLimitedIndexReader breaks query scoring, returning all scores as 0. This is 
because hash codes for the subreaders are stored and reused. Unfortunately the 
getSequentialSubReaders uses a new wrapping object each time and doesn't 
implement hash code in anyway, hence returning oid's. In my case I just 
implemented hashcode in the TimeLimitedIndexReader, returning in.hashcode() and 
this fixed the problem.

 TimeLimitedIndexReader and associated utility class
 ---

 Key: LUCENE-1720
 URL: https://issues.apache.org/jira/browse/LUCENE-1720
 Project: Lucene - Java
  Issue Type: New Feature
  Components: core/index
Reporter: Mark Harwood
Assignee: Mark Harwood
Priority: Minor
 Attachments: ActivityTimeMonitor.java, ActivityTimeMonitor.java, 
 ActivityTimeMonitor.java, ActivityTimedOutException.java, LUCENE-1720.patch, 
 LUCENE-1720.patch, LUCENE-1720.patch, Lucene-1720.patch, Lucene-1720.patch, 
 TestTimeLimitedIndexReader.java, TestTimeLimitedIndexReader.java, 
 TimeLimitedIndexReader.java, TimeLimitedIndexReader.java


 An alternative to TimeLimitedCollector that has the following advantages:
 1) Any reader activity can be time-limited rather than just single searches 
 e.g. the document retrieve phase.
 2) Times out faster (i.e. runaway queries such as fuzzies detected quickly 
 before last collect stage of query processing)
 Uses new utility timeout class that is independent of IndexReader.
 Initial contribution includes a performance test class but not had time as 
 yet to work up a formal Junit test.
 TimeLimitedIndexReader is coded as JDK1.5 but can easily be undone.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4190) IndexWriter deletes non-Lucene files

2012-07-05 Thread Carl Austin (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13407001#comment-13407001
 ] 

Carl Austin commented on LUCENE-4190:
-

I was the original commenter on the blog about this issue and have previously 
experienced the deletion of all files on a drive because of the exact same 
restriction - the fallout from this is massive.

The issue here is that many people who use lucene will not realise that this 
can happen, and this situation will occur sooner or later. You can't expect 
that every developer who uses lucene will understand every in and out, read 
every bit of javadoc fully or every release change note. Look at the number of 
posts to the mailing list that are just people who haven't fully read or 
understood something. I firmly believe that this has to be handled by the 
library such that a simple mistake or misunderstanding by a developer does not 
lead to the loss of important files.

 IndexWriter deletes non-Lucene files
 

 Key: LUCENE-4190
 URL: https://issues.apache.org/jira/browse/LUCENE-4190
 Project: Lucene - Java
  Issue Type: Bug
Reporter: Michael McCandless
Assignee: Robert Muir
 Fix For: 4.0, 5.0

 Attachments: LUCENE-4190.patch, LUCENE-4190.patch


 Carl Austin raised a good issue in a comment on my Lucene 4.0.0 alpha blog 
 post: 
 http://blog.mikemccandless.com/2012/07/lucene-400-alpha-at-long-last.html
 IndexWriter will now (as of 4.0) delete all foreign files from the index 
 directory.  We made this change because Codecs are free to write to any files 
 now, so the space of filenames is hard to bound.
 But if the user accidentally uses the wrong directory (eg c:/) then we will 
 in fact delete important stuff.
 I think we can at least use some simple criteria (must start with _, maybe 
 must fit certain pattern eg _base36(_X).Y), so we are much less likely to 
 delete a non-Lucene file

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1690) Morelikethis queries are very slow compared to other search types

2009-07-20 Thread Carl Austin (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12733234#action_12733234
 ] 

Carl Austin commented on LUCENE-1690:
-

The cache used for this is a HashMap and this is unbounded.  Perhaps this 
should be an LRU cache with a settable maximum number of entries to stop it 
growing forever if you do a lot of like this queries on large indexes with many 
unique terms.
Otherwise nice addition, has sped up my more like this queries a bit.

 Morelikethis queries are very slow compared to other search types
 -

 Key: LUCENE-1690
 URL: https://issues.apache.org/jira/browse/LUCENE-1690
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/*
Affects Versions: 2.4.1
Reporter: Richard Marr
Priority: Minor
 Attachments: LUCENE-1690.patch

   Original Estimate: 2h
  Remaining Estimate: 2h

 The MoreLikeThis object performs term frequency lookups for every query.  
 From my testing that's what seems to take up the majority of time for 
 MoreLikeThis searches.  
 For some (I'd venture many) applications it's not necessary for term 
 statistics to be looked up every time. A fairly naive opt-in caching 
 mechanism tied to the life of the MoreLikeThis object would allow 
 applications to cache term statistics for the duration that suits them.
 I've got this working in my test code. I'll put together a patch file when I 
 get a minute. From my testing this can improve performance by a factor of 
 around 10.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1690) Morelikethis queries are very slow compared to other search types

2009-07-20 Thread Carl Austin (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12733238#action_12733238
 ] 

Carl Austin commented on LUCENE-1690:
-

I wasn't all that scientific I am afraid, just noting that it improved 
performace enough once warmed up to keep on using it. Sorry.
However, after just 3 or 4 more like this queries I am seeing a definate 
improvement, as the majority of freetext is standard vocab, and the unique 
terms only make up a small amount of the rest of the text.


 Morelikethis queries are very slow compared to other search types
 -

 Key: LUCENE-1690
 URL: https://issues.apache.org/jira/browse/LUCENE-1690
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/*
Affects Versions: 2.4.1
Reporter: Richard Marr
Priority: Minor
 Attachments: LUCENE-1690.patch

   Original Estimate: 2h
  Remaining Estimate: 2h

 The MoreLikeThis object performs term frequency lookups for every query.  
 From my testing that's what seems to take up the majority of time for 
 MoreLikeThis searches.  
 For some (I'd venture many) applications it's not necessary for term 
 statistics to be looked up every time. A fairly naive opt-in caching 
 mechanism tied to the life of the MoreLikeThis object would allow 
 applications to cache term statistics for the duration that suits them.
 I've got this working in my test code. I'll put together a patch file when I 
 get a minute. From my testing this can improve performance by a factor of 
 around 10.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1690) Morelikethis queries are very slow compared to other search types

2009-07-30 Thread Carl Austin (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12737107#action_12737107
 ] 

Carl Austin commented on LUCENE-1690:
-

The cache in terminfosreader is for everything as you say. I do a lot of stuff 
with terms, and those terms will get pushed out of this LRU cache very quickly. 
I have a separate cache on my version of the MLT. This has the advantage of 
those terms only being pushed out by other MLT queries, and not by everything 
else I am doing that is not MLT related. 
A lot of MLTs use the same terms, and I have a good size cache for it, meaning 
most terms I use in MLT can be retrieved from there. Seeing as MLT in my 
circumstance is one of the slower bits, this can give me a good advantage.

 Morelikethis queries are very slow compared to other search types
 -

 Key: LUCENE-1690
 URL: https://issues.apache.org/jira/browse/LUCENE-1690
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/*
Affects Versions: 2.4.1
Reporter: Richard Marr
Priority: Minor
 Attachments: LruCache.patch, LUCENE-1690.patch, LUCENE-1690.patch

   Original Estimate: 2h
  Remaining Estimate: 2h

 The MoreLikeThis object performs term frequency lookups for every query.  
 From my testing that's what seems to take up the majority of time for 
 MoreLikeThis searches.  
 For some (I'd venture many) applications it's not necessary for term 
 statistics to be looked up every time. A fairly naive opt-in caching 
 mechanism tied to the life of the MoreLikeThis object would allow 
 applications to cache term statistics for the duration that suits them.
 I've got this working in my test code. I'll put together a patch file when I 
 get a minute. From my testing this can improve performance by a factor of 
 around 10.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org