[jira] Issue Comment Edited: (LUCENE-2369) Locale-based sort by field with low memory overhead

Toke Eskildsen (JIRA) Thu, 02 Sep 2010 08:15:24 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-2369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12905507#action_12905507
 ]


Toke Eskildsen edited comment on LUCENE-2369 at 9/2/10 11:14 AM:
-----------------------------------------------------------------

{quote}
They are just regular terms! you can do a TermQuery on them, sort them as 
byte[], etc.
its just the bytes use 'collation encoding' instead of 'utf-8 encoding'.
{quote}

Yes, you have stated that repeatedly.

As you are unwilling to answer my questions, I've tweaked my tests to see for 
myself. It worked very well, as I learned something new. To answer one of the 
questions, then yes, the terms are loaded into memory when a sort is performed 
on a field. This is the case for the locale-based sort (yes, I have understood 
that you do not consider that a viable form of sorting and I only write it for 
completeness), for STRING-sorting (natural order with ordinal based speedup) 
and for STRING_VAL-sorting (natural order without the ordinal-speedup). All 
this against Lucene truck, no patches.

{quote}
This is why i want to factor out the whole 'locale' thing from the issue, since 
sorting is agnostic to whats in the byte[], its unrelated and it would simplify 
the issue to just discuss that.
{quote}

For my new tests I've switched to natural order sorting (direct byte[] 
comparison aka Lucene STRING sorted search without specifying a Locale). The 
tests should be fairly telling for the different scenarios as the ICU keys 
should be about the same size as the original terms.

{quote}
    * The apis here are wrong anyway: it shouldnt take Locale but Collator.
      There is no way to set strength or any other options, and theres no way 
to supply a Collator i made myself (e.g. from RuleBasedCollator)
{quote}

I fully agree. The code I've made so far takes a Comparator<BytesRef>, with 
optimization for wrapped Collators. The title of LUCENE-2369 is not technically 
correct but was chosen as "Locale" is a fairly known concept while "Collator" 
is more complex. That might have been a mistake.

Onwards to testing with natural order (new Sort(new SortField(myField, 
SortField.STRING))) for Lucene and the hybrid approach (natural order + 
pre-sorting) for exposed. No ZIPping in the background this time, so 
measurements differ from the previous test. Heap sizes were measured after a 
call to System.gc().

2M document index, search hits 1M documents, top 10 hits extracted:

    * Initial exposed search: 0:16 minutes
    * Subsequent exposed searches: 40 ms
    * Total heap usage for Lucene + exposed structure: 20 MB
    * Initial default Lucene search: 0.8 s
    * Subsequent default Lucene searches: 25 ms
    * Total heap usage for Lucene + field cache: 60 MB

20M document index, search hits 10M documents, top 10 hits extracted:

    * Initial exposed search: 2:53 minutes
    * Subsequent exposed searches: 330 ms
    * Total heap usage for Lucene + exposed structure: 154 MB
    * Initial default Lucene search: 6 s
    * Subsequent default Lucene searches: 220 ms
    * Total heap usage for Lucene + field cache: 600 MB

200M document index, search hits 100M documents, top 10 hits extracted:

    * Initial exposed search: 31:33 minutes
    * Subsequent exposed searches: 3200 ms
    * Total heap usage for Lucene + exposed structure: 1660 MB
    * No data for default Lucene search as there was OOM with 7 GB of heap.

What did we learn from this?

   * Natural order search in Lucene with STRING is very fast (as most of the 
work is ordinal comparison).
   * Exposed sorting is actually slower than natural order (that's news for 
me). The culprit is a modified PackedInts-structure. I'll look into that.
   * The exposed structure build penalty for the hybrid approach (i.e. relying 
on natural order instead of doing an explicit sort) was indeed markedly lower 
than exposed with explicit sorting. A factor 5. I would have expected it to be 
more though.
   * The hybrid approach uses less than a third of the amount of RAM required 
by Lucene natural order sorting.

So, Robert, does this answer your challenge "if you have a way to improve 
memory usage for byte[] terms, lets look just at that?"?

      was (Author: toke):
    {quote}
They are just regular terms! you can do a TermQuery on them, sort them as 
byte[], etc.
its just the bytes use 'collation encoding' instead of 'utf-8 encoding'.
{quote}

Yes, you have stated that repeatedly.

As you are unwilling to answer my questions, I've tweaked my tests to see for 
myself. It worked very well, as I learned something new. To answer one of the 
questions, then yes, the terms are loaded into memory when a sort is performed 
on a field. This is the case for the locale-based sort (yes, I have understood 
that you do not consider that a viable form of sorting and I only write it for 
completeness), for STRING-sorting (natural order with ordinal based speedup) 
and for STRING_VAL-sorting (natural order without the ordinal-speedup). All 
this against Lucene truck, no patches.

{quote}
This is why i want to factor out the whole 'locale' thing from the issue, since 
sorting is agnostic to whats in the byte[], its unrelated and it would simplify 
the issue to just discuss that.
{quote}

For my new tests I've switched to natural order sorting (direct byte[] 
comparison aka Lucene STRING sorted search without specifying a Locale). The 
tests should be fairly telling for the different scenarios as the ICU keys 
should be about the same size as the original terms.

{quote}
    * The apis here are wrong anyway: it shouldnt take Locale but Collator.
      There is no way to set strength or any other options, and theres no way 
to supply a Collator i made myself (e.g. from RuleBasedCollator)
{quote}

I fully agree. The code I've made so far takes a Comparator<BytesRef>, with 
optimization for wrapped Collators. The title of LUCENE-2369 is not technically 
correct but was chosen as "Locale" is a fairly known concept while "Collator" 
is more complex. That might have been a mistake.

Onwards to testing with natural order (new Sort(new SortField(myField, 
SortField.STRING))). No ZIPping in the background this time, so measurements 
will differ from previous test. Heap sizes were measured after a call to 
System.gc().

2M document index, search hits 1M documents, top 10 hits extracted:

    * Initial exposed search: 0:16 minutes
    * Subsequent exposed searches: 40 ms
    * Total heap usage for Lucene + exposed structure: 20 MB
    * Initial default Lucene search: 0.8 s
    * Subsequent default Lucene searches: 25 ms
    * Total heap usage for Lucene + field cache: 60 MB

20M document index, search hits 10M documents, top 10 hits extracted:

    * Initial exposed search: 2:53 minutes
    * Subsequent exposed searches: 330 ms
    * Total heap usage for Lucene + exposed structure: 154 MB
    * Initial default Lucene search: 6 s
    * Subsequent default Lucene searches: 220 ms
    * Total heap usage for Lucene + field cache: 600 MB

200M document index, search hits 100M documents, top 10 hits extracted:

    * Initial exposed search: 31:33 minutes
    * Subsequent exposed searches: 3200 ms
    * Total heap usage for Lucene + exposed structure: 1660 MB
    * No data for default Lucene search as there was OOM with 7 GB of heap.

What did we learn from this?

   * Natural order search in Lucene with STRING is very fast (as most of the 
work is ordinal comparison).
   * Exposed sorting is actually slower than natural order (that's news for 
me). The culprit is a modified PackedInts-structure. I'll look into that.
   * The exposed structure build penalty for the hybrid approach (e.g. relying 
on natural order instead of doing an explicit sort) was indeed markedly lower 
than exposed with explicit sorting. A factor 5. I would have expected it to be 
more though.
   * The hybrid approach uses less than a third of the amount of RAM required 
by Lucene natural order sorting.

So, Robert, does this answer your challenge "if you have a way to improve 
memory usage for byte[] terms, lets look just at that?"?
  
> Locale-based sort by field with low memory overhead
> ---------------------------------------------------
>
>                 Key: LUCENE-2369
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2369
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Search
>            Reporter: Toke Eskildsen
>            Priority: Minor
>         Attachments: LUCENE-2369.patch
>
>
> The current implementation of locale-based sort in Lucene uses the FieldCache 
> which keeps all sort terms in memory. Beside the huge memory overhead, 
> searching requires comparison of terms with collator.compare every time, 
> making searches with millions of hits fairly expensive.
> This proposed alternative implementation is to create a packed list of 
> pre-sorted ordinals for the sort terms and a map from document-IDs to entries 
> in the sorted ordinals list. This results in very low memory overhead and 
> faster sorted searches, at the cost of increased startup-time. As the 
> ordinals can be resolved to terms after the sorting has been performed, this 
> approach supports fillFields=true.
> This issue is related to https://issues.apache.org/jira/browse/LUCENE-2335 
> which contain previous discussions on the subject.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] Issue Comment Edited: (LUCENE-2369) Locale-based sort by field with low memory overhead

Reply via email to