[jira] [Comment Edited] (SOLR-9599) DocValues performance regression with new iterator API

Joel Bernstein (JIRA) Sun, 05 Feb 2017 07:31:56 -0800

    [ 
https://issues.apache.org/jira/browse/SOLR-9599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15853248#comment-15853248
 ]


Joel Bernstein edited comment on SOLR-9599 at 2/5/17 3:31 PM:
--------------------------------------------------------------

I've started performance testing the new iterator API as part of the Apache 
Calcite integration. In particular I've been testing the ExportWriter's sort 
and export performance with String fields. So far the performance numbers have 
been comparable to the random access API. 

The types of Streaming Expressions I've been running look like this:

{code}
null(search(enron, q="*:*", fl="to", sort="to desc", qt="/export"))
{code}

This will export and sort all the values in the "to" field in the enron email 
data set. The *null* function simply drops are the tuples so we fully isolate 
the performance of the /export.

I've been using the Direct doc values format for the test, but I'll reindex 
with the default docValues format this week. But typically my advice to anyone 
that wants to maximize streaming performance is to use Direct docValues.

Currently both the old and  new docValues API's perform this operation in 400 
ms.

I'll continue to update this thread as I test numerics and different docValues 
formats, and also increase the size of indexes.

I'll also be testing the performance of field collapsing.


was (Author: joel.bernstein):
I've started performance testing the new iterator API as part of the Apache 
Calcite integration. In particular I've been testing the ExportWriter's sort 
and export performance with String fields. So far the performance numbers have 
been comparable to the random access API. 

The types of Streaming Expressions I've been running look like this:

{code}
null(search(enron, q="*:*", fl="to", sort="to desc", qt="/export"))
{code}

This will export and sort all the values in the "to" field in the enron email 
data set. The *null* function simply drops are the tuples so we fully isolate 
the performance of the /export.

I've been using the Direct doc values format for the test, but I'll reindex 
with the default docValues format this week. But typically my advice to anyone 
that wants to maximize streaming performance is to use Direct docValues.

Currently both the old and  new docValues API's perform this operation in 400 
ms.

I'll continue update this thread as I test numerics and different docValues 
formats, and also increase the size of indexes.



> DocValues performance regression with new iterator API
> ------------------------------------------------------
>
>                 Key: SOLR-9599
>                 URL: https://issues.apache.org/jira/browse/SOLR-9599
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>    Affects Versions: master (7.0)
>            Reporter: Yonik Seeley
>             Fix For: master (7.0)
>
>
> I did a quick performance comparison of faceting indexed fields (i.e. 
> docvalues are not stored) using method=dv before and after the new docvalues 
> iterator went in (LUCENE-7407).
> 5M document index, 21 segments, single valued string fields w/ no missing 
> values.
> || field cardinality || new_time / old_time ||
> |10|2.01|
> |1000|2.02|
> |10000|1.85|
> |100000|1.56|
> |1000000|1.31|
> So unfortunately, often twice as slow.
> See followup messages for tests using real docvalues as well.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (SOLR-9599) DocValues performance regression with new iterator API

Reply via email to