"maureen tanuwidjaja" <[EMAIL PROTECTED]> wrote: > "One thing that stands out in your listing is: your norms file > (_1ke1.nrm) is enormous compared to all other files. Are you indexing > many tiny docs where each docs has highly variable fields or > something?" > > Ya I also confuse why this nrm file is trmendous in size. > I am indexing a total of 657739 XML document . > Total number of fields are 37552 fields (I am using XML tags as the > field)
OK, this is going to be a problem for Lucene. This case will definitely go over 2X disk usage during optimize. I will update the javadocs to call out this caveat. The .nrm file (norms) require 1 byte per document per unique field in the segment, regardless of whether that document has that field (ie, it's not a "sparse" representation). When you have many small docs, and each doc has (somewhat) different fields from the others, this results in a tremendously large storage for the norms. The thing is, within one segment it may be OK since that segment has a subset of all docs and fields. But then when segments are merged (like optimize does) the product of #docs and #fields grows "multiplicatively" and results in far far more storage required than the sum of the individual segments. The only simple workaround I can think of is to set maxMergeDocs to keep all segments "small". But then you may have too many segments with time. Either that or find a way to reduce the number of unique fields that you actually need to store. Mike --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]