[
https://issues.apache.org/jira/browse/LUCENE-2369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12853885#action_12853885
]
Toke Eskildsen commented on LUCENE-2369:
----------------------------------------
Moved from LUCENE-2335 as it really belongs here.
Lotsa devils in the details when you're poking around in the belly of Lucene,
but modulo some business with deleted documents, it looks fine for simple (no
parallel and multi readers) usage. fillFields=true works just as it should by
delaying the actual term resolving until the documents ID's are determined. The
current code makes it possible to create an exposed Sort quite easily:
{code}
ExposedFieldComparatorSource exposedFCS =
new ExposedFieldComparatorSource(reader, new Locale("da"));
Sort sort = new Sort(new SortField("mySortField", exposedFCS));
{code}
For the curious, a modified Lucene-JAR can be downloaded at
http://github.com/tokee/lucene/downloads and tested with
{code}
java -cp lucene-core-3.1-dev-LUCENE-2335-20100405.jar
org.apache.lucene.index.ExposedPOC expose <index> <sortField> <locale>
<defaultField>
{code}
this will present the user with a simple shell where searches can be performed.
Heap-usage and execution times are displayed along with the search result.
I did a little bit of real world experimenting: A 2.5GB index with 400K
documents with 320K unique sort terms took 14 seconds to open. After that, a
locale-based sorted search that hit 250K documents and returned the first 50
took 30ms (fully warmed by re-searching 5 times). Max heap was specified to
40MB of which 20MB was used after the building of the sort structure was
finished.
The same search using the standard locale-oriented sorter took about 1 second
to start up, After that, the 250K search took 130ms, fully warmed. Max heap was
specified to 100MB.
The default sorter was able to get by with 80MB, but execution-time increased
drastically to 2000ms. Probably because of the GC-overhead that the Collator
introduces by temporarily allocating two new objects for each comparison.
The bad news is that this is quite a bit of code (400+ extra lines for the
SegmentReader alone) with several levels of indirection in the data structures.
As an example, getting the actual term for a given docID in the
ExposedFieldComparatorSource is done with
{code}
final long resolvedDocOrder = docOrder.get(order[slot]);
return resolvedDocOrder == undefinedTerm ? null : reader.getTermText(
(int)termOrder.get((int)resolvedDocOrder));
{code}
which is not easily digested without a very thorough explanation, preferably
with a diagram.
The API-changes to the IndexReaders is the addition of two methods:
{code}
String getTermText(int ordinal) throws IOException;
{code}
is self-explanatory, but
{code}
ExposedIterator getExposedTuples(
String persistenceKey, Comparator<Object> comparator, String field,
boolean collectDocIDs) throws IOException;
{code}
is real hard to do when writing a new IndexReader. My current approach is to
use an interface ExposedReader with the methods and let the updated
IndexReaders implement that, thereby making it optional for IndexReaders to be
Exposed.
LUCENE-2335 seems to fulfill the promises so far, with the previously discussed
trade-offs:
* Long startup (~1 min/1M documents, less on re-open)
* Fast locale-based sorted search (supports fillFields=true and has near-zero
GC overhead)
* Very low memory overhead (both permanent and for initializing)
Regarding making a proper patch, I would like to know what I should patch
against. I use LUCENE-1990 so I need to do it against a fairly new version. I
can see that Flex is about to be merged, so I guess it would make sense to wait
for that one.
> Locale-based sort by field with low memory overhead
> ---------------------------------------------------
>
> Key: LUCENE-2369
> URL: https://issues.apache.org/jira/browse/LUCENE-2369
> Project: Lucene - Java
> Issue Type: New Feature
> Components: Search
> Reporter: Toke Eskildsen
> Priority: Minor
>
> The current implementation of locale-based sort in Lucene uses the FieldCache
> which keeps all sort terms in memory. Beside the huge memory overhead,
> searching requires comparison of terms with collator.compare every time,
> making searches with millions of hits fairly expensive.
> This proposed alternative implementation is to create a packed list of
> pre-sorted ordinals for the sort terms and a map from document-IDs to entries
> in the sorted ordinals list. This results in very low memory overhead and
> faster sorted searches, at the cost of increased startup-time. As the
> ordinals can be resolved to terms after the sorting has been performed, this
> approach supports fillFields=true.
> This issue is related to https://issues.apache.org/jira/browse/LUCENE-2335
> which contain previous discussions on the subject.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]