Hi,

generally, yes there should be a reduction in index size when you disable norms. BUT: You need to make the measurement consistent. The problem is that if you disable norms, the segments have different size and get merged in a different way. As you did not tell us if you have deleted or updated documents during indexing, it is completely undefined how the segments of the index are merged during their lifetime.

To make a correct comparison, make sure to force-merge ("optimize" in Solr speak) the index at end before committing. After that take the size for comparison. With using force merge you make sure to compact all segments to one single segment having only a single terms index and a single postings list for each term.

Of course, you should never ever force-merge an productive index which is not read-only or rarely updated (but needs to be force merged after updates again and again). In general, removing norms makes not much sense for size (it won't affect you index too much), it is only an optimization for speeding up queries where scoring is not needed. So doing that for index size does not help under normal circumstances because the size variance due to the multi segment structure and merges going on is much higher than the additional norms docvalues field.

Uwe

Am 31.12.2024 um 14:50 schrieb Balaram Sharma:
Dear Developers,

I learned that *omitting norms during indexing for a field saves a byte per document *in Lucene. However, during my testing, I observed varying results in the overall size of the Lucene index (collection of documents) when disabling norms for string fields during indexing.

Here are the configuration details for reference:

  * *Lucene Version:* 5.3.1
  * *Java Version:* OpenJDK 17.0.8.1
  * *Indexer Configuration:*
      o |index.merge_factor|: 10
      o |index.partition_max_doc|: 5,000,000
      o |indexer.commit_interval_sec|: 60
      o |indexer.commit_max_doc|: 100,000
  * *Merge Policy:* LogByteSizeMergePolicy

*Test Results:
*

*TEST DATA*

        

*#UNIQUE FIELDS IN AN INDEX(5M DOCUMENTS)*

        

*#STRING FIELDS - FOR WHICH NORMS WILL BE ENABLED OR DISABLED*

        

*AVG SIZE OF INDEX IN MB [NORMS ENABLED]*

        

*AVG SIZE OF INDEX IN MB [NORMS DISABLED]*

        

*DIFFERENCE*

DATA - I (All documents contain same set of fields and their values)

        

103

        

74

        

1869

        

1876

        

No difference

DATA - II (All documents contain same set of fields but having random values)

        

128

        

113

        

25412

        

31890

        

Increased by 20%

DATA - II (Documents contain different sets of field-value pairs, subsets of all field-value pairs)

        

184

        

87

        

2295

        

2005

        

Reduced by 14%



DATA - IV(Documents contain different sets of field-value pairs, subsets of all field-value pairs)

        

1091

        

1026

        

10512

        

5905

        

Reduced by 43%

Could you please provide insights or clarify whether this behavior aligns with the expected impact on index size? Additionally, could you explain why the size reduction appears to be unpredictable?

Thank you for your assistance!


With Regards,

Balaram Sharma

--
Uwe Schindler
Achterdiek 19, D-28357 Bremen
https://www.thetaphi.de
eMail:u...@thetaphi.de

Reply via email to