Re: PATCH: SegmentsReader/SegmentsTermEnum

Otis Gospodnetic Wed, 10 Sep 2003 07:17:28 -0700

Christoph,

Thanks for the patch and the test.
I refactored your test a bit, converted it to a JUnit-based unit test
and will commit it shortly, following it with your patch.


Thank you,
Otis

--- Christoph Goller <[EMAIL PROTECTED]> wrote:
> Hi Lucene Developers,
> 
> first let me thank you all for this excellent peace of software
> that you created. I am using Lucene in several projects and I
> am currently also building more enhanced text mining applications
> on top of it. Because of that I have spent a lot of time studying
> the Lucene sources and I will come up with a couple of proposals
> for bug fixes in the next days. Here is the first one:
> 
> I think I can fix a bug in SegmentsTermEnum.
> One can create a TermEnum from an IndexReader in two ways:
> 
> indexReader.terms()
> indexReader.terms(t)
> 
> If one gets a TermEnum starting at a specified term t one does not
> have to call enum.next() before using it. The enum is valid from the
> beginning.Calling enum.next() switches to the next term. However,
> this
> bahaviour is only true if our index consists of only one segment. If
> we
> have an index consisting of several segments term t is delivered
> twice,
> 1st time after calling indexReader.terms(t); enum.term(), 2nd time
> after
> calling enum.next(). Furthermore the initial document frequency might
> be false (if t occurs in more than one segment). The problem can be
> fixed by calling next() in the constructor of SegmentsTermEnum.
> I attach a test that demonstrates the problem and a patch that fixes
> it.
> 
> kind regards,
> Christoph
> 
> -- 
> *****************************************************************
> * Dr. Christoph Goller       Tel.:   +49 89 203 45734           *
> * Detego Software GmbH       Mobile: +49 179 1128469            *
> * Keuslinstr. 13             Fax.:   +49 721 151516176          *
> * 80798 M�nchen, Germany     Email:  [EMAIL PROTECTED]  *
> *****************************************************************
> > import java.io.IOException;
> 
> import org.apache.lucene.analysis.WhitespaceAnalyzer;
> import org.apache.lucene.document.Document;
> import org.apache.lucene.document.Field;
> import org.apache.lucene.index.IndexReader;
> import org.apache.lucene.index.IndexWriter;
> import org.apache.lucene.index.Term;
> import org.apache.lucene.index.TermEnum;
> import org.apache.lucene.store.Directory;
> import org.apache.lucene.store.RAMDirectory;
> 
> /*
>  * Created on 23.04.2003
>  *
>  * To change the template for this generated file go to
>  * Window>Preferences>Java>Code Generation>Code and Comments
>  */
> 
> /**
>  * @author goller
>  *
>  * To change the template for this generated type comment go to
>  * Window>Preferences>Java>Code Generation>Code and Comments
>  */
> public class SegmentsTermEnumTest {
> 
>   int docCount = 0;
>   
>   void addDoc1(IndexWriter writer)
>   {
>     Document doc = new Document();
>     
>     doc.add(Field.Keyword("id","id" + docCount));
>     doc.add(Field.UnStored("content","aaa"));
>     
>     try {
>       writer.addDocument(doc);
>     }
>     catch (IOException e) {
>       // TODO Auto-generated catch block
>       e.printStackTrace();
>     }
>     docCount++;
>   }
>   
>   void addDoc2(IndexWriter writer)
>   {
>     Document doc = new Document();
>     
>     doc.add(Field.Keyword("id","id" + docCount));
>     doc.add(Field.UnStored("content","aaa bbb"));
>     
>     try {
>       writer.addDocument(doc);
>     }
>     catch (IOException e) {
>       // TODO Auto-generated catch block
>       e.printStackTrace();
>     }
>     docCount++;
>   }
>   
>   
>   
>   public static void main(String[] args) 
>   {
>     //System.out.println(System.getProperty("java.version"));
>     
>     Directory dir = new RAMDirectory();
>     SegmentsTermEnumTest test = new SegmentsTermEnumTest();
>     
>     IndexWriter writer = null;
>     IndexReader reader = null;
>     TermEnum enum = null;
>     int i;
>     
>     try {
>       writer  = new IndexWriter(dir, new WhitespaceAnalyzer(), true);
>       
>       for (i = 0; i < 100; i++)
>         test.addDoc1(writer);
>       
>       for (i = 0; i < 100; i++)
>         test.addDoc2(writer);
>       
>       writer.close();
>     }
>     catch (IOException e) {
>       // TODO Auto-generated catch block
>       e.printStackTrace();
>     }
>     
>     
>     try {
>       reader = IndexReader.open(dir);
>   
>       System.out.println("terms():");
>       enum = reader.terms();
>       for(i = 0; i < 5 && enum.next(); i++)
>         System.out.println(enum.term() + " " + enum.docFreq());
>       
>       enum.close();
>       
>       System.out.println();
>       System.out.println("terms(\"aaa\")");
>       enum = reader.terms(new Term("content", "aaa"));
>       System.out.println(enum.term() + " " + enum.docFreq());
>       for(i = 0; i < 5 && enum.next(); i++)
>         System.out.println(enum.term() + " " + enum.docFreq());
>         
>       enum.close();
>       reader.close();
>       
>       writer  = new IndexWriter(dir, new WhitespaceAnalyzer(),
> false);
>       writer.optimize();
>       writer.close();
>    
>       System.out.println();
>       System.out.println("optimize");
>       
>       reader = IndexReader.open(dir);
>   
>       System.out.println();
>       System.out.println("terms():");
>       enum = reader.terms();
>       for(i = 0; i < 5 && enum.next(); i++)
>         System.out.println(enum.term() + " " + enum.docFreq());
>       
>       enum.close();
>       
>       System.out.println();
>       System.out.println("terms(\"aaa\")");
>       enum = reader.terms(new Term("content", "aaa"));
>       System.out.println(enum.term() + " " + enum.docFreq());
>       for(i = 0; i < 5 && enum.next(); i++)
>         System.out.println(enum.term() + " " + enum.docFreq());
>         
>       enum.close();
>       reader.close();
>       
>     }
>     catch (IOException e2) {
>       // TODO Auto-generated catch block
>       e2.printStackTrace();
>     }
>     
> 
>     
>     
>     
>     
>   }
> }
> > Index: SegmentsReader.java
> ===================================================================
> RCS file:
>
/home/cvspublic/jakarta-lucene/src/java/org/apache/lucene/index/SegmentsReader.java,v
> retrieving revision 1.11
> diff -u -r1.11 SegmentsReader.java
> --- SegmentsReader.java       1 May 2003 01:09:15 -0000       1.11
> +++ SegmentsReader.java       3 Sep 2003 13:03:27 -0000
> @@ -238,9 +238,7 @@
>      }
>  
>      if (t != null && queue.size() > 0) {
> -      SegmentMergeInfo top = (SegmentMergeInfo)queue.top();
> -      term = top.termEnum.term();
> -      docFreq = top.termEnum.docFreq();
> +      next();
>      }
>    }
>  
> 
> >
---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]


__________________________________
Do you Yahoo!?
Yahoo! SiteBuilder - Free, easy-to-use web site design software
http://sitebuilder.yahoo.com

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: PATCH: SegmentsReader/SegmentsTermEnum

Reply via email to