[
https://issues.apache.org/jira/browse/LUCENE-2369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12905507#action_12905507
]
Toke Eskildsen edited comment on LUCENE-2369 at 9/2/10 11:14 AM:
-----------------------------------------------------------------
{quote}
They are just regular terms! you can do a TermQuery on them, sort them as
byte[], etc.
its just the bytes use 'collation encoding' instead of 'utf-8 encoding'.
{quote}
Yes, you have stated that repeatedly.
As you are unwilling to answer my questions, I've tweaked my tests to see for
myself. It worked very well, as I learned something new. To answer one of the
questions, then yes, the terms are loaded into memory when a sort is performed
on a field. This is the case for the locale-based sort (yes, I have understood
that you do not consider that a viable form of sorting and I only write it for
completeness), for STRING-sorting (natural order with ordinal based speedup)
and for STRING_VAL-sorting (natural order without the ordinal-speedup). All
this against Lucene truck, no patches.
{quote}
This is why i want to factor out the whole 'locale' thing from the issue, since
sorting is agnostic to whats in the byte[], its unrelated and it would simplify
the issue to just discuss that.
{quote}
For my new tests I've switched to natural order sorting (direct byte[]
comparison aka Lucene STRING sorted search without specifying a Locale). The
tests should be fairly telling for the different scenarios as the ICU keys
should be about the same size as the original terms.
{quote}
* The apis here are wrong anyway: it shouldnt take Locale but Collator.
There is no way to set strength or any other options, and theres no way
to supply a Collator i made myself (e.g. from RuleBasedCollator)
{quote}
I fully agree. The code I've made so far takes a Comparator<BytesRef>, with
optimization for wrapped Collators. The title of LUCENE-2369 is not technically
correct but was chosen as "Locale" is a fairly known concept while "Collator"
is more complex. That might have been a mistake.
Onwards to testing with natural order (new Sort(new SortField(myField,
SortField.STRING))) for Lucene and the hybrid approach (natural order +
pre-sorting) for exposed. No ZIPping in the background this time, so
measurements differ from the previous test. Heap sizes were measured after a
call to System.gc().
2M document index, search hits 1M documents, top 10 hits extracted:
* Initial exposed search: 0:16 minutes
* Subsequent exposed searches: 40 ms
* Total heap usage for Lucene + exposed structure: 20 MB
* Initial default Lucene search: 0.8 s
* Subsequent default Lucene searches: 25 ms
* Total heap usage for Lucene + field cache: 60 MB
20M document index, search hits 10M documents, top 10 hits extracted:
* Initial exposed search: 2:53 minutes
* Subsequent exposed searches: 330 ms
* Total heap usage for Lucene + exposed structure: 154 MB
* Initial default Lucene search: 6 s
* Subsequent default Lucene searches: 220 ms
* Total heap usage for Lucene + field cache: 600 MB
200M document index, search hits 100M documents, top 10 hits extracted:
* Initial exposed search: 31:33 minutes
* Subsequent exposed searches: 3200 ms
* Total heap usage for Lucene + exposed structure: 1660 MB
* No data for default Lucene search as there was OOM with 7 GB of heap.
What did we learn from this?
* Natural order search in Lucene with STRING is very fast (as most of the
work is ordinal comparison).
* Exposed sorting is actually slower than natural order (that's news for
me). The culprit is a modified PackedInts-structure. I'll look into that.
* The exposed structure build penalty for the hybrid approach (i.e. relying
on natural order instead of doing an explicit sort) was indeed markedly lower
than exposed with explicit sorting. A factor 5. I would have expected it to be
more though.
* The hybrid approach uses less than a third of the amount of RAM required
by Lucene natural order sorting.
So, Robert, does this answer your challenge "if you have a way to improve
memory usage for byte[] terms, lets look just at that?"?
was (Author: toke):
{quote}
They are just regular terms! you can do a TermQuery on them, sort them as
byte[], etc.
its just the bytes use 'collation encoding' instead of 'utf-8 encoding'.
{quote}
Yes, you have stated that repeatedly.
As you are unwilling to answer my questions, I've tweaked my tests to see for
myself. It worked very well, as I learned something new. To answer one of the
questions, then yes, the terms are loaded into memory when a sort is performed
on a field. This is the case for the locale-based sort (yes, I have understood
that you do not consider that a viable form of sorting and I only write it for
completeness), for STRING-sorting (natural order with ordinal based speedup)
and for STRING_VAL-sorting (natural order without the ordinal-speedup). All
this against Lucene truck, no patches.
{quote}
This is why i want to factor out the whole 'locale' thing from the issue, since
sorting is agnostic to whats in the byte[], its unrelated and it would simplify
the issue to just discuss that.
{quote}
For my new tests I've switched to natural order sorting (direct byte[]
comparison aka Lucene STRING sorted search without specifying a Locale). The
tests should be fairly telling for the different scenarios as the ICU keys
should be about the same size as the original terms.
{quote}
* The apis here are wrong anyway: it shouldnt take Locale but Collator.
There is no way to set strength or any other options, and theres no way
to supply a Collator i made myself (e.g. from RuleBasedCollator)
{quote}
I fully agree. The code I've made so far takes a Comparator<BytesRef>, with
optimization for wrapped Collators. The title of LUCENE-2369 is not technically
correct but was chosen as "Locale" is a fairly known concept while "Collator"
is more complex. That might have been a mistake.
Onwards to testing with natural order (new Sort(new SortField(myField,
SortField.STRING))). No ZIPping in the background this time, so measurements
will differ from previous test. Heap sizes were measured after a call to
System.gc().
2M document index, search hits 1M documents, top 10 hits extracted:
* Initial exposed search: 0:16 minutes
* Subsequent exposed searches: 40 ms
* Total heap usage for Lucene + exposed structure: 20 MB
* Initial default Lucene search: 0.8 s
* Subsequent default Lucene searches: 25 ms
* Total heap usage for Lucene + field cache: 60 MB
20M document index, search hits 10M documents, top 10 hits extracted:
* Initial exposed search: 2:53 minutes
* Subsequent exposed searches: 330 ms
* Total heap usage for Lucene + exposed structure: 154 MB
* Initial default Lucene search: 6 s
* Subsequent default Lucene searches: 220 ms
* Total heap usage for Lucene + field cache: 600 MB
200M document index, search hits 100M documents, top 10 hits extracted:
* Initial exposed search: 31:33 minutes
* Subsequent exposed searches: 3200 ms
* Total heap usage for Lucene + exposed structure: 1660 MB
* No data for default Lucene search as there was OOM with 7 GB of heap.
What did we learn from this?
* Natural order search in Lucene with STRING is very fast (as most of the
work is ordinal comparison).
* Exposed sorting is actually slower than natural order (that's news for
me). The culprit is a modified PackedInts-structure. I'll look into that.
* The exposed structure build penalty for the hybrid approach (e.g. relying
on natural order instead of doing an explicit sort) was indeed markedly lower
than exposed with explicit sorting. A factor 5. I would have expected it to be
more though.
* The hybrid approach uses less than a third of the amount of RAM required
by Lucene natural order sorting.
So, Robert, does this answer your challenge "if you have a way to improve
memory usage for byte[] terms, lets look just at that?"?
> Locale-based sort by field with low memory overhead
> ---------------------------------------------------
>
> Key: LUCENE-2369
> URL: https://issues.apache.org/jira/browse/LUCENE-2369
> Project: Lucene - Java
> Issue Type: New Feature
> Components: Search
> Reporter: Toke Eskildsen
> Priority: Minor
> Attachments: LUCENE-2369.patch
>
>
> The current implementation of locale-based sort in Lucene uses the FieldCache
> which keeps all sort terms in memory. Beside the huge memory overhead,
> searching requires comparison of terms with collator.compare every time,
> making searches with millions of hits fairly expensive.
> This proposed alternative implementation is to create a packed list of
> pre-sorted ordinals for the sort terms and a map from document-IDs to entries
> in the sorted ordinals list. This results in very low memory overhead and
> faster sorted searches, at the cost of increased startup-time. As the
> ordinals can be resolved to terms after the sorting has been performed, this
> approach supports fillFields=true.
> This issue is related to https://issues.apache.org/jira/browse/LUCENE-2335
> which contain previous discussions on the subject.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]