Re: Inquiry regarding storage reduction by omitting NORMS during indexing

Uwe Schindler Fri, 03 Jan 2025 02:01:07 -0800

Hi,

generally, yes there should be a reduction in index size when youdisable norms. BUT: You need to make the measurement consistent. Theproblem is that if you disable norms, the segments have different sizeand get merged in a different way. As you did not tell us if you havedeleted or updated documents during indexing, it is completely undefinedhow the segments of the index are merged during their lifetime.

To make a correct comparison, make sure to force-merge ("optimize" inSolr speak) the index at end before committing. After that take the sizefor comparison. With using force merge you make sure to compact allsegments to one single segment having only a single terms index and asingle postings list for each term.

Of course, you should never ever force-merge an productive index whichis not read-only or rarely updated (but needs to be force merged afterupdates again and again). In general, removing norms makes not muchsense for size (it won't affect you index too much), it is only anoptimization for speeding up queries where scoring is not needed. Sodoing that for index size does not help under normal circumstancesbecause the size variance due to the multi segment structure and mergesgoing on is much higher than the additional norms docvalues field.


Uwe

Am 31.12.2024 um 14:50 schrieb Balaram Sharma:

Dear Developers,
I learned that *omitting norms during indexing for a field saves abyte per document *in Lucene. However, during my testing, I observedvarying results in the overall size of the Lucene index (collection ofdocuments) when disabling norms for string fields during indexing.
Here are the configuration details for reference:

  * *Lucene Version:* 5.3.1
  * *Java Version:* OpenJDK 17.0.8.1
  * *Indexer Configuration:*
      o |index.merge_factor|: 10
      o |index.partition_max_doc|: 5,000,000
      o |indexer.commit_interval_sec|: 60
      o |indexer.commit_max_doc|: 100,000
  * *Merge Policy:* LogByteSizeMergePolicy

*Test Results:
*

*TEST DATA*

        

*#UNIQUE FIELDS IN AN INDEX(5M DOCUMENTS)*

        

*#STRING FIELDS - FOR WHICH NORMS WILL BE ENABLED OR DISABLED*

        

*AVG SIZE OF INDEX IN MB [NORMS ENABLED]*

        

*AVG SIZE OF INDEX IN MB [NORMS DISABLED]*

        

*DIFFERENCE*

DATA - I (All documents contain same set of fields and their values)

        

103

        

74

        

1869

        

1876

        

No difference
DATA - II (All documents contain same set of fields but having randomvalues)
        

128

        

113

        

25412

        

31890

        

Increased by 20%
DATA - II (Documents contain different sets of field-value pairs,subsets of all field-value pairs)
        

184

        

87

        

2295

        

2005

        

Reduced by 14%
DATA - IV(Documents contain different sets of field-value pairs,subsets of all field-value pairs)
        

1091

        

1026

        

10512

        

5905

        

Reduced by 43%
Could you please provide insights or clarify whether this behavioraligns with the expected impact on index size? Additionally, could youexplain why the size reduction appears to be unpredictable?
Thank you for your assistance!


With Regards,

Balaram Sharma

--
Uwe Schindler
Achterdiek 19, D-28357 Bremen
https://www.thetaphi.de
eMail:u...@thetaphi.de

Re: Inquiry regarding storage reduction by omitting NORMS during indexing

Reply via email to