Proposition :adding minMergeDoc to IndexWriter

Julien Nioche Tue, 23 Sep 2003 02:34:30 -0700

Hui,

Concerning an other point of your request list I proposed a patch this week
end on the lucene-dev list and i totally forgot that this feature was
requested on the user list.


This new feature should help you to set a number of Documents to be merged
in memory independently of the mergeFactor.

Any comments would be appreciated

Best regards

Julien Nioche
http://www.lingway.com

---------- Debut du message initial -----------

De     : "fp235-5" <[EMAIL PROTECTED]>
A      : "lucene-dev" <[EMAIL PROTECTED]>
Copies :
Date   : Sat, 20 Sep 2003 16:06:06 +0200
Sujet  : [PATCH] IndexWriter : controling the number of Docs merged

Hello,

Someone made a suggestion yesterday about adding a variable to IndexWriter
in
order to control the number of Documents merged in RAMDirectory
independently of
the mergeFactor. (I'm sorry I don't remember who exactly and the mail
arrived at
my office).
I'm proposing a tiny modification of IndexWriter to add this functionality.
A
variable minMergeDocs specifies the number of Documents to be merged in
memory
before starting a new Segment. The mergeFactor still control the number of
Segments created in the Directory and thus it's possible to avoid the file
number limitation problem.

The diff file is attached.

As noticed by Dmitry and Erik there are no true JUnit tests. I'd be OK to
write
a JUnit test for this feature. The problem is that the SegmentInfos field is
private in IndexWriter and can't be used to check the number and size of the
Segments. I ran a test using the infoStream variable of IndexWriter -
everything
seems to be OK.

Any comments / suggestions are welcome.

Regards

Julien









----- Original Message -----
From: "hui" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Monday, September 22, 2003 3:40 PM
Subject: Re: per-field Analyzer (was Re: some requests)


> Good work, Erik.
>
> Hui
>
> ----- Original Message -----
> From: "Erik Hatcher" <[EMAIL PROTECTED]>
> To: "Lucene Users List" <[EMAIL PROTECTED]>
> Sent: Saturday, September 20, 2003 4:13 AM
> Subject: per-field Analyzer (was Re: some requests)
>
>
> > On Friday, September 19, 2003, at 07:45  PM, Erik Hatcher wrote:
> > > On Friday, September 19, 2003, at 11:15  AM, hui wrote:
> > >> 1. Move the Analyzer down to field level from document level so some
> > >> fields
> > >> could be applied a specail analyzer.Other fields still use the
default
> > >> analyzer from the document level.
> > >> For example, I do not need to index the number for the "content"
> > >> field. It
> > >> helps me reduce the index size a lot when I have some excel files.
> > >> But I
> > >> always need the "created_date" to be indexed though it is a number
> > >> field.
> > >>
> > >> I know there are some workarounds put in the group, but I think it
> > >> should be
> > >> a good feature to have.
> > >
> > > The "workaround" is to write a custom analyzer and and have it do the
> > > desired thing per-field.
> > >
> > > Hmmm.... just thinking out loud here without knowing if this is
> > > possible, but could a generic "wrapper" Analyzer be written that
> > > allows other analyzers to be used under the covers based on a field
> > > name/analyzer mapping?   If so, that would be quite cool and save
> > > folks from having to write custom analyzers as much to handle this
> > > pretty typical use-case.  I'll look into this more in the very near
> > > future personally, but feel free to have a look at this yourself and
> > > see what you can come up with.
> >
> > What about something like this?
> >
> > public class PerFieldWrapperAnalyzer extends Analyzer {
> >    private Analyzer defaultAnalyzer;
> >    private Map analyzerMap = new HashMap();
> >
> >
> >    public PerFieldWrapperAnalyzer(Analyzer defaultAnalyzer) {
> >      this.defaultAnalyzer = defaultAnalyzer;
> >    }
> >
> >    public void addAnalyzer(String fieldName, Analyzer analyzer) {
> >      analyzerMap.put(fieldName, analyzer);
> >    }
> >
> >    public TokenStream tokenStream(String fieldName, Reader reader) {
> >      Analyzer analyzer = (Analyzer) analyzerMap.get(fieldName);
> >      if (analyzer == null) {
> >        analyzer = defaultAnalyzer;
> >      }
> >
> >      return analyzer.tokenStream(fieldName, reader);
> >    }
> > }
> >
> > This would allow you to construct a single analyzer out of others, on a
> > per-field basis, including a default one for any fields that do not
> > have a special one.  Whether the constructor should take the map or the
> > addAnalyzer method is implemented is debatable, but I prefer the
> > addAnalyzer way.  Maybe addAnalyzer could return 'this' so you could
> > chain: new PerFieldWrapperAnalyzer(new
> > StandardAnalyzer).addAnalyzer("field1", new
> > WhitespaceAnalyzer()).addAnalyzer(.....).  And I'm more inclined to
> > call this thing PerFieldAnalyzerWrapper instead.  Any naming
> > suggestions?
> >
> > This simple little class would seem to be the answer to a very common
> > question asked.
> >
> > Thoughts?  Should this be made part of the core?
> >
> > Erik
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>

Index: IndexWriter.java
===================================================================
RCS file: 
/home/cvspublic/jakarta-lucene/src/java/org/apache/lucene/index/IndexWriter.java,v
retrieving revision 1.15
diff -u -r1.15 IndexWriter.java
--- IndexWriter.java    15 Sep 2003 12:40:23 -0000      1.15
+++ IndexWriter.java    20 Sep 2003 12:22:13 -0000
@@ -249,6 +249,16 @@
    *
    * <p>This must never be less than 2.  The default value is 10.*/
   public int mergeFactor = 10;
+  
+  /** Determines the minimal number of documents required before merging
+   * and starting a new Segment. Since Documents are merged in a 
+   * [EMAIL PROTECTED] org.apache.lucene.store.RAMDirectory}, large value gives 
faster 
+   * indexing. At the same time mergeFactor limits the number of files open in 
+   * a FSDirectory.
+   * 
+   * <p> The default value is 10.*/
+  public int minMergeDocs = 10;
+  
 
   /** Determines the largest number of documents ever merged by addDocument().
    * Small values (e.g., less than 10,000) are best for interactive indexing,
@@ -316,7 +326,7 @@
 
   /** Incremental segment merger.  */
   private final void maybeMergeSegments() throws IOException {
-    long targetMergeDocs = mergeFactor;
+    long targetMergeDocs = minMergeDocs;
     while (targetMergeDocs <= maxMergeDocs) {
       // find segments smaller than current target size
       int minSegment = segmentInfos.size();

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Proposition :adding minMergeDoc to IndexWriter

Reply via email to