Re: Urgent : How much actually the disk space needed to optimize the index?

Michael McCandless Tue, 13 Mar 2007 05:45:19 -0800

"maureen tanuwidjaja" <[EMAIL PROTECTED]> wrote:

>   "One thing that stands out in your listing is: your norms file
>   (_1ke1.nrm) is enormous compared to all other files.  Are you indexing
>   many tiny docs where each docs has highly variable fields or
>   something?"
>   
>   Ya I also confuse why this nrm file is trmendous in size.
>   I am indexing a total of 657739 XML document .
>   Total number of fields are 37552 fields (I am using XML tags as the
>   field)


OK, this is going to be a problem for Lucene.

This case will definitely go over 2X disk usage during optimize.  I
will update the javadocs to call out this caveat.

The .nrm file (norms) require 1 byte per document per unique field in
the segment, regardless of whether that document has that field (ie,
it's not a "sparse" representation).

When you have many small docs, and each doc has (somewhat) different
fields from the others, this results in a tremendously large storage
for the norms.

The thing is, within one segment it may be OK since that segment has a
subset of all docs and fields.  But then when segments are merged
(like optimize does) the product of #docs and #fields grows
"multiplicatively" and results in far far more storage required than
the sum of the individual segments.

The only simple workaround I can think of is to set maxMergeDocs to
keep all segments "small".  But then you may have too many segments
with time.  Either that or find a way to reduce the number of unique
fields that you actually need to store.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Urgent : How much actually the disk space needed to optimize the index?

Reply via email to