A while ago I started on LUCENE-2369 with the goal making fast and 
memory-efficient locale-aware sorting in Lucene. During this I’ve wandered 
quite a bit out of course while exploring the possibilities in ordinals. I’ve 
discovered that two building blocks enables sorting, (hierarchical) faceting, 
index lookup and range searches in a way that supports custom ordering and has 
the trade offs “slow initialization, fast execution, low memory overhead”. I 
would like to share my thoughts on this and I would like to hear if and how I 
should move forward with this.

At the core, this is about using ordinals instead of the terms themselves. 
Index-wide ordinals. For Lucene trunk, SegmentReaders supports ordinal-to-term 
resolving directly, albeit not super optimized. Other IndexReaders does not 
support this out of the box and there’s the whole problem of how to handle 
duplicate terms. As part of LUCENE-2369 I implemented

Building block 1: A map from custom ordered positions to term ordinals, without 
duplicates. Example: Let’s look at the terms for a given field split over 3 
segments: Segment 1: A, B, C Segment 2: B, D Segment 3: A, E The index-wide 
ordered list of the terms would be A, B, C, D, E. The map from ordered 
positions to ordinals would be (segment 1, ordinal 0), (s 1, o 2), (s 1, o 3), 
(s 2, o 1), (s 3, o 2).

This index-level map takes up #logical_positions*log2(#terms) bits (PackedInts 
to the rescue), rounded up. For the example above, this is 5*log2(7) bits. 
Scaling up to 10 million terms, out of which 5 million is unique, we have 
5M*log2(10M) bits = 5M*24 bits = 14 MByte.

If custom order is used, we would like to cache the calculated segment oriented 
maps for faster re-opening of the index. As there are no duplicates inside of a 
segment, the sum of the segment maps will be a bit more than 
#terms*log2(#terms) bits. For 10M terms, this is 29 MByte extra, bringing the 
total memory requirement up to 43 MByte.

Building the map requires the sorting of the ordinals at segment level (this 
can be skipped if the index order is to be used) plus a merge-sort of the 
segment oriented ordinals to get the index oriented ordered positions. This map 
could be used for custom order range-queries by doing two binary searches and 
iterating between the points. It could also be used for index lookup. Of 
course, one might just make a field with a custom order at index time (long 
live flex), so maybe this way of doing it has little value. However, the 
building block is a stepping stone for

Building block 2: A map from document IDs to ordered positions for a given 
field. Building this map is in principle a matter of stepping through the terms 
from building block 1 and requesting all the document IDs for the terms from 
all segments. Some memory optimization can be done if there is only 1 
term/document, but the general case is handled by keeping an array of entry 
points for all document IDs into an array of ordered term positions for the 
document IDs.

Memory requirements for this map is #documents*log2(#documents) + 
#term_references*log2(#logical terms) bits. For 20M documents with 40M 
references to 5M unique terms (aka ordered positions) this is 20M*log2(20M) + 
40M*log2(5M) bits = 20M*25 + 40M*23 bits = 228 MByte. On top of that comes the 
memory requirements for building block 1.

With building block 1 + 2 we can implement the rest of the functionality:

Sorting is trivial: The order of two documents is determined by comparing the 
first integer from building block 2 for each document. We could also specialize 
building block 2 to exactly 1 term/document to save some memory and gain a 
little more speed.

Faceting is simple: A counting array (the obvious choice is to use int[] or 
long[]) of length “#ordered_positions from building block 1” is created. The 
document IDs for all the hits from a search is used to get the ordered 
positions from building block 2 and the corresponding entries in the counting 
array is incremented. Extracting the tags by count is done by iterating the 
counting array with a priority queue, finding the top-X tags ordinals, then 
extracting the terms themselves with the map from building block 1. Extracting 
by order is done by iterating and collecting the indexes of all entries where 
the count is > x (typically 1). As far as I understand, this is more or less 
the approach used by Solr’s field cache faceting.

Hierarchical faceting is more complex. Let’s leave it for another post and just 
say that augmenting building block 1 with about 1 byte/logical term gives us 
what we want. The code as well as unit tests is in LUCENE-2369 for the curious.


There is working code that we use in-house with a test-setup. It needs cleanup 
and polishing, but does not require changes to existing Lucene code. It could 
be part of Lucene core (for sorting and range search at least), it could be a 
contrib, it might belong in Solr (SOLR-64 or maybe SOLR-792), Bobo-browse or 
somewhere else. My personal preference is to make a Lucene contrib so that the 
maximum amount of projects can benefit, but I would very much like to hear some 
thoughts on this.

Thank you for listening,
Toke Eskildsen
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to