A while ago I started on LUCENE-2369 with the goal making fast and memory-efficient locale-aware sorting in Lucene. During this I’ve wandered quite a bit out of course while exploring the possibilities in ordinals. I’ve discovered that two building blocks enables sorting, (hierarchical) faceting, index lookup and range searches in a way that supports custom ordering and has the trade offs “slow initialization, fast execution, low memory overhead”. I would like to share my thoughts on this and I would like to hear if and how I should move forward with this.
At the core, this is about using ordinals instead of the terms themselves. Index-wide ordinals. For Lucene trunk, SegmentReaders supports ordinal-to-term resolving directly, albeit not super optimized. Other IndexReaders does not support this out of the box and there’s the whole problem of how to handle duplicate terms. As part of LUCENE-2369 I implemented Building block 1: A map from custom ordered positions to term ordinals, without duplicates. Example: Let’s look at the terms for a given field split over 3 segments: Segment 1: A, B, C Segment 2: B, D Segment 3: A, E The index-wide ordered list of the terms would be A, B, C, D, E. The map from ordered positions to ordinals would be (segment 1, ordinal 0), (s 1, o 2), (s 1, o 3), (s 2, o 1), (s 3, o 2). This index-level map takes up #logical_positions*log2(#terms) bits (PackedInts to the rescue), rounded up. For the example above, this is 5*log2(7) bits. Scaling up to 10 million terms, out of which 5 million is unique, we have 5M*log2(10M) bits = 5M*24 bits = 14 MByte. If custom order is used, we would like to cache the calculated segment oriented maps for faster re-opening of the index. As there are no duplicates inside of a segment, the sum of the segment maps will be a bit more than #terms*log2(#terms) bits. For 10M terms, this is 29 MByte extra, bringing the total memory requirement up to 43 MByte. Building the map requires the sorting of the ordinals at segment level (this can be skipped if the index order is to be used) plus a merge-sort of the segment oriented ordinals to get the index oriented ordered positions. This map could be used for custom order range-queries by doing two binary searches and iterating between the points. It could also be used for index lookup. Of course, one might just make a field with a custom order at index time (long live flex), so maybe this way of doing it has little value. However, the building block is a stepping stone for Building block 2: A map from document IDs to ordered positions for a given field. Building this map is in principle a matter of stepping through the terms from building block 1 and requesting all the document IDs for the terms from all segments. Some memory optimization can be done if there is only 1 term/document, but the general case is handled by keeping an array of entry points for all document IDs into an array of ordered term positions for the document IDs. Memory requirements for this map is #documents*log2(#documents) + #term_references*log2(#logical terms) bits. For 20M documents with 40M references to 5M unique terms (aka ordered positions) this is 20M*log2(20M) + 40M*log2(5M) bits = 20M*25 + 40M*23 bits = 228 MByte. On top of that comes the memory requirements for building block 1. With building block 1 + 2 we can implement the rest of the functionality: Sorting is trivial: The order of two documents is determined by comparing the first integer from building block 2 for each document. We could also specialize building block 2 to exactly 1 term/document to save some memory and gain a little more speed. Faceting is simple: A counting array (the obvious choice is to use int[] or long[]) of length “#ordered_positions from building block 1” is created. The document IDs for all the hits from a search is used to get the ordered positions from building block 2 and the corresponding entries in the counting array is incremented. Extracting the tags by count is done by iterating the counting array with a priority queue, finding the top-X tags ordinals, then extracting the terms themselves with the map from building block 1. Extracting by order is done by iterating and collecting the indexes of all entries where the count is > x (typically 1). As far as I understand, this is more or less the approach used by Solr’s field cache faceting. Hierarchical faceting is more complex. Let’s leave it for another post and just say that augmenting building block 1 with about 1 byte/logical term gives us what we want. The code as well as unit tests is in LUCENE-2369 for the curious. There is working code that we use in-house with a test-setup. It needs cleanup and polishing, but does not require changes to existing Lucene code. It could be part of Lucene core (for sorting and range search at least), it could be a contrib, it might belong in Solr (SOLR-64 or maybe SOLR-792), Bobo-browse or somewhere else. My personal preference is to make a Lucene contrib so that the maximum amount of projects can benefit, but I would very much like to hear some thoughts on this. Thank you for listening, Toke Eskildsen --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
