Yonik Seeley kirjoitti 4.9.2017 klo 18.03:
It's due to this (see comments in UnInvertedField):
*   To further save memory, the terms (the actual string values) are
not all stored in
*   memory, but a TermIndex is used to convert term numbers to term values only
*   for the terms needed after faceting has completed.  Only every
128th term value
*   is stored, along with its corresponding term number, and this is used as an
*   index to find the closest term and iterate until the desired number is hit

There's probably a number of ways we can speed this up somewhat:
- optimize how much memory is used to store the term index and use the
savings to store more than every 128th term
- store the terms contiguously in block(s)
- don't store the whole term, only store what's needed to seek to the
Nth term correctly
- when retrieving many terms, sort them first and convert from ord->str in order

For what it's worth, I've now tested on our production servers that can hold the full index in memory, and the results are in line with the previous ones (47 million records, 1785 buckets in the tested facet):

1.) index with docValues="true":

- unoptimized: ~6000ms if facet.method is not specified
- unoptimized: ~7000ms with facet.method=uif
- optimized: ~7800ms if facet.method is not specified
- optimized: ~7700ms with facet.method=uif

Note that optimization took its time and other activity varies throughout the day, so the numbers between optimized and unoptimized cannot be directly compared. Still bugs me a bit that the optimized index seems to be a bit slower here.

2.) index with docValues="false":

- unoptimized: ~2600ms if facet.method is not specified
- unoptimized ~1200ms with facet.method=uif
- optimized: ~2600ms if facet.method is not specified
- optimized: ~17ms with facet.method=uif

--Ere

--
Ere Maijala
Kansalliskirjasto / The National Library of Finland

Reply via email to