[
https://issues.apache.org/jira/browse/LUCENE-8374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16593536#comment-16593536
]
Toke Eskildsen commented on LUCENE-8374:
----------------------------------------
My previous stats script was slightly off. I isolated the block /
DENSE-structures and got
```
Index: 890GB / 51 fields / 307M docs
Blocks: Total 238971 (blocks: ALL=102693, SPARSE=28739, DENSE=102421,
EMPTY=5118) / 1866KB
DENSE-Rank: 28118KB (raw DENSE data: 102421*8KB = 819368KB)
vBPV(long): 7 / 654KB
Total: 30653KB, 13 seconds start up
```
DENSE is by far the biggest consumer of cache-space in our setup. Interestingly
enough, the vBPV-caching was the ones that gave us by far the biggest benefit,
for the few long fields that we have.
I looked at skip lists, both in the
[MultiLevelSkipListWriter|https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/codecs/MultiLevelSkipListWriter.java]
and [on Wikipedia|[https://en.wikipedia.org/wiki/Skip_list].] As I understand
it they are essentially a {{rank+select}} implementation that allows for
varying-size skips and works well with a linked list that can be modified. The
varying-size and mutability does not seem to be used/relevant for Lucene.
What I don't really understand is the benefit of skip list's multi-level
approach in this case. How would a skip list be better than the current
direct-lookup in the array of longs representing offset+bitcount? If the point
is to save further memory, the block-offsets could be stated for every 4th
block or so, just as the skip lists does. But the current overhead of 2MB for a
rather large segment does not seem problematic to me and it does mean that 0
superfluous blocks needs to be seeked.
New point: I would very much like to make this issue two-step.
As the performance regression gets worse linear with segment size, it seems
plausible that the people that will benefit the most from the patch are also
people where a full re-index is not trivial. From my point of view, search-time
caching should be present for present segments, independently of codec-changes
to future segments. The current patch needs polishing, but is functionwise
ready and does exactly this.
As the search-time cache is non-destructive, rolling this as step 1 would be a
conservative update with easy rollback. Step 2 is of course to change the codec
to embed the caching structures, if they prove their worth.
> Reduce reads for sparse DocValues
> ---------------------------------
>
> Key: LUCENE-8374
> URL: https://issues.apache.org/jira/browse/LUCENE-8374
> Project: Lucene - Core
> Issue Type: Improvement
> Components: core/codecs
> Affects Versions: 7.4, master (8.0)
> Reporter: Toke Eskildsen
> Priority: Major
> Labels: performance
> Fix For: 7.4, master (8.0), 7.3.1
>
> Attachments: LUCENE-8374.patch, LUCENE-8374.patch, LUCENE-8374.patch,
> LUCENE-8374_branch_7_3.patch, LUCENE-8374_branch_7_4.patch
>
>
> The {{Lucene70DocValuesProducer}} has the internal classes
> {{SparseNumericDocValues}} and {{BaseSortedSetDocValues}} (sparse code path),
> which again uses {{IndexedDISI}} to handle the docID -> value-ordinal lookup.
> The value-ordinal is the index of the docID assuming an abstract tightly
> packed monotonically increasing list of docIDs: If the docIDs with
> corresponding values are {{[0, 4, 1432]}}, their value-ordinals will be {{[0,
> 1, 2]}}.
> h2. Outer blocks
> The lookup structure of {{IndexedDISI}} consists of blocks of 2^16 values
> (65536), where each block can be either {{ALL}}, {{DENSE}} (2^12 to 2^16
> values) or {{SPARSE}} (< 2^12 values ~= 6%). Consequently blocks vary quite a
> lot in size and ordinal resolving strategy.
> When a sparse Numeric DocValue is needed, the code first locates the block
> containing the wanted docID flag. It does so by iterating blocks one-by-one
> until it reaches the needed one, where each iteration requires a lookup in
> the underlying {{IndexSlice}}. For a common memory mapped index, this
> translates to either a cached request or a read operation. If a segment has
> 6M documents, worst-case is 91 lookups. In our web archive, our segments has
> ~300M values: A worst-case of 4577 lookups!
> One obvious solution is to use a lookup-table for blocks: A long[]-array with
> an entry for each block. For 6M documents, that is < 1KB and would allow for
> direct jumping (a single lookup) in all instances. Unfortunately this
> lookup-table cannot be generated upfront when the writing of values is purely
> streaming. It can be appended to the end of the stream before it is closed,
> but without knowing the position of the lookup-table the reader cannot seek
> to it.
> One strategy for creating such a lookup-table would be to generate it during
> reads and cache it for next lookup. This does not fit directly into how
> {{IndexedDISI}} currently works (it is created anew for each invocation), but
> could probably be added with a little work. An advantage to this is that this
> does not change the underlying format and thus could be used with existing
> indexes.
> h2. The lookup structure inside each block
> If {{ALL}} of the 2^16 values are defined, the structure is empty and the
> ordinal is simply the requested docID with some modulo and multiply math.
> Nothing to improve there.
> If the block is {{DENSE}} (2^12 to 2^16 values are defined), a bitmap is used
> and the number of set bits up to the wanted index (the docID modulo the block
> origo) are counted. That bitmap is a long[1024], meaning that worst case is
> to lookup and count all set bits for 1024 longs!
> One known solution to this is to use a [rank
> structure|[https://en.wikipedia.org/wiki/Succinct_data_structure]]. I
> [implemented
> it|[https://github.com/tokee/lucene-solr/blob/solr5894/solr/core/src/java/org/apache/solr/search/sparse/count/plane/RankCache.java]]
> for a related project and with that (), the rank-overhead for a {{DENSE}}
> block would be long[32] and would ensure a maximum of 9 lookups. It is not
> trivial to build the rank-structure and caching it (assuming all blocks are
> dense) for 6M documents would require 22 KB (3.17% overhead). It would be far
> better to generate the rank-structure at index time and store it immediately
> before the bitset (this is possible with streaming as each block is fully
> resolved before flushing), but of course that would require a change to the
> codec.
> If {{SPARSE}} (< 2^12 values ~= 6%) are defined, the docIDs are simply in the
> form of a list. As a comment in the code suggests, a binary search through
> these would be faster, although that would mean seeking backwards. If that is
> not acceptable, I don't have any immediate idea for avoiding the full
> iteration.
> I propose implementing query-time caching of both block-jumps and inner-block
> lookups for {{DENSE}} (using rank) as first improvement and an index-time
> {{DENSE}}-rank structure for future improvement. As query-time caching is
> likely to be too costly for rapidly-changing indexes, it should probably be
> an opt-in in solrconfig.xml.
> h2. Some real-world observations
> This analysis was triggered by massive (10x) slowdown problems with both
> simple querying and large exports from our webarchive index after upgrading
> from Solr 4.10 to 7.3.1. The query-matching itself takes ½-2 seconds, but
> returning the top-10 documents takes 5-20 seconds (~50 non-stored DocValues
> fields), up from ½-2 seconds in total from Solr 4.10 (more of a mix of stored
> vs. DocValues, so might not be directly comparable).
> Measuring with VisualVM points to {{NIOFSIndexInput.readInternal}} as *the*
> hotspot. We ran some tests with simple queries on a single 307,171,504
> document segment with different single-value DocValued fields in the fl and
> got
>
> ||Field||Type||Docs with value||Docs w/ val %||Speed in docs/sec||
> |url|String|307,171,504|100%|12,500|
> |content_type_ext|String|224,375,378|73%|360|
> |author|String|1,506,365|0.5%|1,100|
> |crawl_date|DatePoint|307,171,498|~100%|90|
> |content_text_length|IntPoint|285,800,212|93%|410|
> |content_length|IntPoint|307,016,816|99.9%|100|
> |crawl_year|IntPoint|307,171,498|~100%|14,500|
> |last_modified|DatePoint|6,835,065|2.2%|570|
> |source_file_offset|LongPoint|307,171,504|100%|28,000|
> Note how both url and source_file_offset are very fast and also has a value
> for _all_ documents. Contrary to this, content_type_ext is very slow and
> crawl_date is extremely slow and as they both have _nearly_ all documents, I
> presume they are using {{IndexedDISI#DENSE}}. last_modified is also quite
> slow and presumably uses {{IndexedDISI#SPARSE}}.
> The only mystery is crawl_year which is also present in _nearly_ all
> documents, but is very fast. I have no explanation for that one yet.
> I hope to take a stab at this around August 2018, but no promises.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]