[ https://issues.apache.org/jira/browse/LUCENE-8374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16708907#comment-16708907 ]
Toke Eskildsen commented on LUCENE-8374: ---------------------------------------- [~jpountz] thank you for the well-defined arguments. I see your point much more clearly now. {quote}The fact that the default codec happily computes data at search-time rather than index-time may set a precedent... {quote} I think you place way too little faith in codec developers. Also, the codec is a scary beast and I seriously doubt that changes here will fly under the radar of you or other developers. Concretely the injection of caches is in a few key places and one could well add explicit comments of them being a band aid, to ensure no inspiration follows. {quote}The introduced complexity of search-time-only support for jump tables makes it harder to evolve this doc-value format so that it computes jump tables at index time. {quote} I don't see how that follows. If anything the patch points out the places that should use the index-time structures, which (unless LUCENE-8585 changes that) works the same way as search-time. Where possible, the skipping code has been extracted to helper methods, making it even easier to build on. {quote}We are reintroducing a feature that potentially allows the memory usage of readers to grow over time rather than being set in stone after the reader is open... {quote} True. I was not aware of the problems with this. One "solution" would be to cache all the DocValues up front. That would mean wasted time & space for non-used fields, so I don't like it. {quote}Search-time-only support introduces contention and/or complexity due to the need to access a cache concurrently. {quote} True (and I think I have already found a problem there). With the addendum that we have the same problem if we decide that some structures should be kept in memory for LUCENE-8585. >From my point of view you are prioritizing architecture over functionality to >a very high degree. It might well be more in line with the Lucene code base than my prioritization and on that ground I concede that the search-time code should be removed (and thereby that plug-in functionality for existing indexes goes away) when an index-time solution has been made. Removing a working band aid solution before a working clean one has been committed is not something I support at this point. On the more agreeable side, thanks to [~jim.ferenczi] for volunteering with index-time. With a little luck we can wrap that up fairly quick. I do have time before January, just not much. I guess we should talk details in the LUCENE-8585 issue? > Reduce reads for sparse DocValues > --------------------------------- > > Key: LUCENE-8374 > URL: https://issues.apache.org/jira/browse/LUCENE-8374 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs > Affects Versions: 7.5, master (8.0) > Reporter: Toke Eskildsen > Priority: Major > Labels: performance > Attachments: LUCENE-8374.patch, LUCENE-8374.patch, LUCENE-8374.patch, > LUCENE-8374.patch, LUCENE-8374.patch, LUCENE-8374.patch, LUCENE-8374.patch, > LUCENE-8374_branch_7_3.patch, LUCENE-8374_branch_7_3.patch.20181005, > LUCENE-8374_branch_7_4.patch, LUCENE-8374_branch_7_5.patch, > LUCENE-8374_part_1.patch, LUCENE-8374_part_2.patch, LUCENE-8374_part_3.patch, > LUCENE-8374_part_4.patch, entire_index_logs.txt, > image-2018-10-24-07-30-06-663.png, image-2018-10-24-07-30-56-962.png, > single_vehicle_logs.txt, > start-2018-10-24-1_snapshot___Users_tim_Snapshots__-_YourKit_Java_Profiler_2017_02-b75_-_64-bit.png, > > start-2018-10-24_snapshot___Users_tim_Snapshots__-_YourKit_Java_Profiler_2017_02-b75_-_64-bit.png > > > The {{Lucene70DocValuesProducer}} has the internal classes > {{SparseNumericDocValues}} and {{BaseSortedSetDocValues}} (sparse code path), > which again uses {{IndexedDISI}} to handle the docID -> value-ordinal lookup. > The value-ordinal is the index of the docID assuming an abstract tightly > packed monotonically increasing list of docIDs: If the docIDs with > corresponding values are {{[0, 4, 1432]}}, their value-ordinals will be {{[0, > 1, 2]}}. > h2. Outer blocks > The lookup structure of {{IndexedDISI}} consists of blocks of 2^16 values > (65536), where each block can be either {{ALL}}, {{DENSE}} (2^12 to 2^16 > values) or {{SPARSE}} (< 2^12 values ~= 6%). Consequently blocks vary quite a > lot in size and ordinal resolving strategy. > When a sparse Numeric DocValue is needed, the code first locates the block > containing the wanted docID flag. It does so by iterating blocks one-by-one > until it reaches the needed one, where each iteration requires a lookup in > the underlying {{IndexSlice}}. For a common memory mapped index, this > translates to either a cached request or a read operation. If a segment has > 6M documents, worst-case is 91 lookups. In our web archive, our segments has > ~300M values: A worst-case of 4577 lookups! > One obvious solution is to use a lookup-table for blocks: A long[]-array with > an entry for each block. For 6M documents, that is < 1KB and would allow for > direct jumping (a single lookup) in all instances. Unfortunately this > lookup-table cannot be generated upfront when the writing of values is purely > streaming. It can be appended to the end of the stream before it is closed, > but without knowing the position of the lookup-table the reader cannot seek > to it. > One strategy for creating such a lookup-table would be to generate it during > reads and cache it for next lookup. This does not fit directly into how > {{IndexedDISI}} currently works (it is created anew for each invocation), but > could probably be added with a little work. An advantage to this is that this > does not change the underlying format and thus could be used with existing > indexes. > h2. The lookup structure inside each block > If {{ALL}} of the 2^16 values are defined, the structure is empty and the > ordinal is simply the requested docID with some modulo and multiply math. > Nothing to improve there. > If the block is {{DENSE}} (2^12 to 2^16 values are defined), a bitmap is used > and the number of set bits up to the wanted index (the docID modulo the block > origo) are counted. That bitmap is a long[1024], meaning that worst case is > to lookup and count all set bits for 1024 longs! > One known solution to this is to use a [rank > structure|[https://en.wikipedia.org/wiki/Succinct_data_structure]]. I > [implemented > it|[https://github.com/tokee/lucene-solr/blob/solr5894/solr/core/src/java/org/apache/solr/search/sparse/count/plane/RankCache.java]] > for a related project and with that (), the rank-overhead for a {{DENSE}} > block would be long[32] and would ensure a maximum of 9 lookups. It is not > trivial to build the rank-structure and caching it (assuming all blocks are > dense) for 6M documents would require 22 KB (3.17% overhead). It would be far > better to generate the rank-structure at index time and store it immediately > before the bitset (this is possible with streaming as each block is fully > resolved before flushing), but of course that would require a change to the > codec. > If {{SPARSE}} (< 2^12 values ~= 6%) are defined, the docIDs are simply in the > form of a list. As a comment in the code suggests, a binary search through > these would be faster, although that would mean seeking backwards. If that is > not acceptable, I don't have any immediate idea for avoiding the full > iteration. > I propose implementing query-time caching of both block-jumps and inner-block > lookups for {{DENSE}} (using rank) as first improvement and an index-time > {{DENSE}}-rank structure for future improvement. As query-time caching is > likely to be too costly for rapidly-changing indexes, it should probably be > an opt-in in solrconfig.xml. > h2. Some real-world observations > This analysis was triggered by massive (10x) slowdown problems with both > simple querying and large exports from our webarchive index after upgrading > from Solr 4.10 to 7.3.1. The query-matching itself takes ½-2 seconds, but > returning the top-10 documents takes 5-20 seconds (~50 non-stored DocValues > fields), up from ½-2 seconds in total from Solr 4.10 (more of a mix of stored > vs. DocValues, so might not be directly comparable). > Measuring with VisualVM points to {{NIOFSIndexInput.readInternal}} as *the* > hotspot. We ran some tests with simple queries on a single 307,171,504 > document segment with different single-value DocValued fields in the fl and > got > > ||Field||Type||Docs with value||Docs w/ val %||Speed in docs/sec|| > |url|String|307,171,504|100%|12,500| > |content_type_ext|String|224,375,378|73%|360| > |author|String|1,506,365|0.5%|1,100| > |crawl_date|DatePoint|307,171,498|~100%|90| > |content_text_length|IntPoint|285,800,212|93%|410| > |content_length|IntPoint|307,016,816|99.9%|100| > |crawl_year|IntPoint|307,171,498|~100%|14,500| > |last_modified|DatePoint|6,835,065|2.2%|570| > |source_file_offset|LongPoint|307,171,504|100%|28,000| > Note how both url and source_file_offset are very fast and also has a value > for _all_ documents. Contrary to this, content_type_ext is very slow and > crawl_date is extremely slow and as they both have _nearly_ all documents, I > presume they are using {{IndexedDISI#DENSE}}. last_modified is also quite > slow and presumably uses {{IndexedDISI#SPARSE}}. > The only mystery is crawl_year which is also present in _nearly_ all > documents, but is very fast. I have no explanation for that one yet. > I hope to take a stab at this around August 2018, but no promises. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org