[ 
https://issues.apache.org/jira/browse/LUCENE-8374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16693349#comment-16693349
 ] 

Toke Eskildsen commented on LUCENE-8374:
----------------------------------------

{quote}this patch spends significant CPU and I/O when opening the reader
{quote}
This patch spends the IO equivalent of making a single worst-case search 
matching 1/16384th of the documents in the newly opened segment, sorting on a 
numeric DocValues field, for each used DocValued field. I am hard pressed to 
imagine a setup where that IO-penalty wouldn't pay off very quickly.

As for CPU it does perform a compression on the DENSE data structure, to lower 
the memory overhead. In Tim's setup, I could not get the timing for his smaller 
segments as they took less than 1ms. For his largest segment (19M documents) it 
took a total of 23ms across 41 DocValues fields. For Tim his startup overhead 
(a total of 412ms across all segments & fields) was paid off in less than a 
second of operation.

Based on this, I disagree on your assertion of "significant" performance 
overhead for search-time. It might turn out to be significant when applied 
broadly or tested in depth, but at this point neither theory nor observations 
suggests so.
{quote}For instance it will be easier to track which change a slowdown/speedup 
relates to in Mike's nightly benchmarks.
{quote}
As [discussed|https://github.com/mikemccand/luceneutil/issues/23] on Mike's 
GitHub space, his nightly benchmarks are not really tracking this part 
(large-jump access of DocValues) of Lucene. If they were, an alarm would have 
been raised on the switch to the iterative API.

It is quite understandable that such tests are missing, as large indexes 
requires a lot of computing resources and/or time. How to quickly catch such 
regressions in the future, without spending a lump sum on hardware, is a 
challenge that I have no ready answers for.

In its current form, Mike's benchmarks seems fine for tracking LUCENE-8374 
regressions in the areas where there is no expectation of a speed-up, which is 
an important part of testing the value of an attempted optimization.
{quote}I'll take care of reviews.
{quote}
Thank you, I appreciate that.

As for the wait-for-7.6, I just based it on the note in 
[https://mail-archives.apache.org/mod_mbox/lucene-dev/201811.mbox/ajax/%3CCABM%2Bu9JU%3DQFshN_juicOohU8dZQBtZ0Nc7Zs79_Wt9wZRjTVqA%40mail.gmail.com%3E]
 about big changes. I am happy to classify this patch as a non-big change.

> Reduce reads for sparse DocValues
> ---------------------------------
>
>                 Key: LUCENE-8374
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8374
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: core/codecs
>    Affects Versions: 7.5, master (8.0)
>            Reporter: Toke Eskildsen
>            Priority: Major
>              Labels: performance
>             Fix For: 7.6
>
>         Attachments: LUCENE-8374.patch, LUCENE-8374.patch, LUCENE-8374.patch, 
> LUCENE-8374.patch, LUCENE-8374.patch, LUCENE-8374.patch, LUCENE-8374.patch, 
> LUCENE-8374_branch_7_3.patch, LUCENE-8374_branch_7_3.patch.20181005, 
> LUCENE-8374_branch_7_4.patch, LUCENE-8374_branch_7_5.patch, 
> entire_index_logs.txt, image-2018-10-24-07-30-06-663.png, 
> image-2018-10-24-07-30-56-962.png, single_vehicle_logs.txt, 
> start-2018-10-24-1_snapshot___Users_tim_Snapshots__-_YourKit_Java_Profiler_2017_02-b75_-_64-bit.png,
>  
> start-2018-10-24_snapshot___Users_tim_Snapshots__-_YourKit_Java_Profiler_2017_02-b75_-_64-bit.png
>
>
> The {{Lucene70DocValuesProducer}} has the internal classes 
> {{SparseNumericDocValues}} and {{BaseSortedSetDocValues}} (sparse code path), 
> which again uses {{IndexedDISI}} to handle the docID -> value-ordinal lookup. 
> The value-ordinal is the index of the docID assuming an abstract tightly 
> packed monotonically increasing list of docIDs: If the docIDs with 
> corresponding values are {{[0, 4, 1432]}}, their value-ordinals will be {{[0, 
> 1, 2]}}.
> h2. Outer blocks
> The lookup structure of {{IndexedDISI}} consists of blocks of 2^16 values 
> (65536), where each block can be either {{ALL}}, {{DENSE}} (2^12 to 2^16 
> values) or {{SPARSE}} (< 2^12 values ~= 6%). Consequently blocks vary quite a 
> lot in size and ordinal resolving strategy.
> When a sparse Numeric DocValue is needed, the code first locates the block 
> containing the wanted docID flag. It does so by iterating blocks one-by-one 
> until it reaches the needed one, where each iteration requires a lookup in 
> the underlying {{IndexSlice}}. For a common memory mapped index, this 
> translates to either a cached request or a read operation. If a segment has 
> 6M documents, worst-case is 91 lookups. In our web archive, our segments has 
> ~300M values: A worst-case of 4577 lookups!
> One obvious solution is to use a lookup-table for blocks: A long[]-array with 
> an entry for each block. For 6M documents, that is < 1KB and would allow for 
> direct jumping (a single lookup) in all instances. Unfortunately this 
> lookup-table cannot be generated upfront when the writing of values is purely 
> streaming. It can be appended to the end of the stream before it is closed, 
> but without knowing the position of the lookup-table the reader cannot seek 
> to it.
> One strategy for creating such a lookup-table would be to generate it during 
> reads and cache it for next lookup. This does not fit directly into how 
> {{IndexedDISI}} currently works (it is created anew for each invocation), but 
> could probably be added with a little work. An advantage to this is that this 
> does not change the underlying format and thus could be used with existing 
> indexes.
> h2. The lookup structure inside each block
> If {{ALL}} of the 2^16 values are defined, the structure is empty and the 
> ordinal is simply the requested docID with some modulo and multiply math. 
> Nothing to improve there.
> If the block is {{DENSE}} (2^12 to 2^16 values are defined), a bitmap is used 
> and the number of set bits up to the wanted index (the docID modulo the block 
> origo) are counted. That bitmap is a long[1024], meaning that worst case is 
> to lookup and count all set bits for 1024 longs!
> One known solution to this is to use a [rank 
> structure|[https://en.wikipedia.org/wiki/Succinct_data_structure]]. I 
> [implemented 
> it|[https://github.com/tokee/lucene-solr/blob/solr5894/solr/core/src/java/org/apache/solr/search/sparse/count/plane/RankCache.java]]
>  for a related project and with that (), the rank-overhead for a {{DENSE}} 
> block would be long[32] and would ensure a maximum of 9 lookups. It is not 
> trivial to build the rank-structure and caching it (assuming all blocks are 
> dense) for 6M documents would require 22 KB (3.17% overhead). It would be far 
> better to generate the rank-structure at index time and store it immediately 
> before the bitset (this is possible with streaming as each block is fully 
> resolved before flushing), but of course that would require a change to the 
> codec.
> If {{SPARSE}} (< 2^12 values ~= 6%) are defined, the docIDs are simply in the 
> form of a list. As a comment in the code suggests, a binary search through 
> these would be faster, although that would mean seeking backwards. If that is 
> not acceptable, I don't have any immediate idea for avoiding the full 
> iteration.
> I propose implementing query-time caching of both block-jumps and inner-block 
> lookups for {{DENSE}} (using rank) as first improvement and an index-time 
> {{DENSE}}-rank structure for future improvement. As query-time caching is 
> likely to be too costly for rapidly-changing indexes, it should probably be 
> an opt-in in solrconfig.xml.
> h2. Some real-world observations
> This analysis was triggered by massive (10x) slowdown problems with both 
> simple querying and large exports from our webarchive index after upgrading 
> from Solr 4.10 to 7.3.1. The query-matching itself takes ½-2 seconds, but 
> returning the top-10 documents takes 5-20 seconds (~50 non-stored DocValues 
> fields), up from ½-2 seconds in total from Solr 4.10 (more of a mix of stored 
> vs. DocValues, so might not be directly comparable).
> Measuring with VisualVM points to {{NIOFSIndexInput.readInternal}} as *the* 
> hotspot.  We ran some tests with simple queries on a single 307,171,504 
> document segment with different single-value DocValued fields in the fl and 
> got
>  
> ||Field||Type||Docs with value||Docs w/ val %||Speed in docs/sec||
> |url|String|307,171,504|100%|12,500|
> |content_type_ext|String|224,375,378|73%|360|
> |author|String|1,506,365|0.5%|1,100|
> |crawl_date|DatePoint|307,171,498|~100%|90|
> |content_text_length|IntPoint|285,800,212|93%|410|
> |content_length|IntPoint|307,016,816|99.9%|100|
> |crawl_year|IntPoint|307,171,498|~100%|14,500|
> |last_modified|DatePoint|6,835,065|2.2%|570|
> |source_file_offset|LongPoint|307,171,504|100%|28,000|
>  Note how both url and source_file_offset are very fast and also has a value 
> for _all_ documents. Contrary to this, content_type_ext is very slow and 
> crawl_date is extremely slow and as they both have _nearly_ all documents, I 
> presume they are using {{IndexedDISI#DENSE}}. last_modified is also quite 
> slow and presumably uses {{IndexedDISI#SPARSE}}.
> The only mystery is crawl_year which is also present in _nearly_ all 
> documents, but is very fast. I have no explanation for that one yet.
> I hope to take a stab at this around August 2018, but no promises.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to