Hi,

The reason for this is multithreaded merging. While indexing, Lucene merges 
segments in a separate threads. As this runs multithreaded, there is no strict 
"order of things". Depending on how fast the disk is or what other processes 
are running in parallel, the merging may proceed fast or slower so creating 
another "index structure", where different segments are merged in other 
combinations, leading to different term dictionary or posting list sizes.

If you do a forceMerge(1) at the end (can take very long time), the whole index 
is merged into one segment, which should have the same size for the same 
dataset. Please don't compare file MD5/SHA1, the files will *not* be identical, 
because order of documents may still vary.

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de


> -----Original Message-----
> From: Jose Carlos Canova [mailto:jose.carlos.can...@gmail.com]
> Sent: Tuesday, March 25, 2014 6:36 AM
> To: java-user@lucene.apache.org
> Subject: Index size for Same DataSet.
> 
> Hello,
> 
> I have a doubt about index size,
> I am testing a program using Lucene to index some dataset.
> 
> At the final the result of index size is varying a little, since i haven't 
> finished
> the tests at all, i'm doubt if it is normal the index size vary on size among
> different tests.
> 
> att.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to