Hi,
generally, yes there should be a reduction in index size when you
disable norms. BUT: You need to make the measurement consistent. The
problem is that if you disable norms, the segments have different size
and get merged in a different way. As you did not tell us if you have
deleted or updated documents during indexing, it is completely undefined
how the segments of the index are merged during their lifetime.
To make a correct comparison, make sure to force-merge ("optimize" in
Solr speak) the index at end before committing. After that take the size
for comparison. With using force merge you make sure to compact all
segments to one single segment having only a single terms index and a
single postings list for each term.
Of course, you should never ever force-merge an productive index which
is not read-only or rarely updated (but needs to be force merged after
updates again and again). In general, removing norms makes not much
sense for size (it won't affect you index too much), it is only an
optimization for speeding up queries where scoring is not needed. So
doing that for index size does not help under normal circumstances
because the size variance due to the multi segment structure and merges
going on is much higher than the additional norms docvalues field.
Uwe
Am 31.12.2024 um 14:50 schrieb Balaram Sharma:
Dear Developers,
I learned that *omitting norms during indexing for a field saves a
byte per document *in Lucene. However, during my testing, I observed
varying results in the overall size of the Lucene index (collection of
documents) when disabling norms for string fields during indexing.
Here are the configuration details for reference:
* *Lucene Version:* 5.3.1
* *Java Version:* OpenJDK 17.0.8.1
* *Indexer Configuration:*
o |index.merge_factor|: 10
o |index.partition_max_doc|: 5,000,000
o |indexer.commit_interval_sec|: 60
o |indexer.commit_max_doc|: 100,000
* *Merge Policy:* LogByteSizeMergePolicy
*Test Results:
*
*TEST DATA*
*#UNIQUE FIELDS IN AN INDEX(5M DOCUMENTS)*
*#STRING FIELDS - FOR WHICH NORMS WILL BE ENABLED OR DISABLED*
*AVG SIZE OF INDEX IN MB [NORMS ENABLED]*
*AVG SIZE OF INDEX IN MB [NORMS DISABLED]*
*DIFFERENCE*
DATA - I (All documents contain same set of fields and their values)
103
74
1869
1876
No difference
DATA - II (All documents contain same set of fields but having random
values)
128
113
25412
31890
Increased by 20%
DATA - II (Documents contain different sets of field-value pairs,
subsets of all field-value pairs)
184
87
2295
2005
Reduced by 14%
DATA - IV(Documents contain different sets of field-value pairs,
subsets of all field-value pairs)
1091
1026
10512
5905
Reduced by 43%
Could you please provide insights or clarify whether this behavior
aligns with the expected impact on index size? Additionally, could you
explain why the size reduction appears to be unpredictable?
Thank you for your assistance!
With Regards,
Balaram Sharma
--
Uwe Schindler
Achterdiek 19, D-28357 Bremen
https://www.thetaphi.de
eMail:u...@thetaphi.de