[
https://issues.apache.org/jira/browse/SOLR-9599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15853248#comment-15853248
]
Joel Bernstein edited comment on SOLR-9599 at 2/5/17 3:31 PM:
--------------------------------------------------------------
I've started performance testing the new iterator API as part of the Apache
Calcite integration. In particular I've been testing the ExportWriter's sort
and export performance with String fields. So far the performance numbers have
been comparable to the random access API.
The types of Streaming Expressions I've been running look like this:
{code}
null(search(enron, q="*:*", fl="to", sort="to desc", qt="/export"))
{code}
This will export and sort all the values in the "to" field in the enron email
data set. The *null* function simply drops are the tuples so we fully isolate
the performance of the /export.
I've been using the Direct doc values format for the test, but I'll reindex
with the default docValues format this week. But typically my advice to anyone
that wants to maximize streaming performance is to use Direct docValues.
Currently both the old and new docValues API's perform this operation in 400
ms.
I'll continue to update this thread as I test numerics and different docValues
formats, and also increase the size of indexes.
I'll also be testing the performance of field collapsing.
was (Author: joel.bernstein):
I've started performance testing the new iterator API as part of the Apache
Calcite integration. In particular I've been testing the ExportWriter's sort
and export performance with String fields. So far the performance numbers have
been comparable to the random access API.
The types of Streaming Expressions I've been running look like this:
{code}
null(search(enron, q="*:*", fl="to", sort="to desc", qt="/export"))
{code}
This will export and sort all the values in the "to" field in the enron email
data set. The *null* function simply drops are the tuples so we fully isolate
the performance of the /export.
I've been using the Direct doc values format for the test, but I'll reindex
with the default docValues format this week. But typically my advice to anyone
that wants to maximize streaming performance is to use Direct docValues.
Currently both the old and new docValues API's perform this operation in 400
ms.
I'll continue update this thread as I test numerics and different docValues
formats, and also increase the size of indexes.
> DocValues performance regression with new iterator API
> ------------------------------------------------------
>
> Key: SOLR-9599
> URL: https://issues.apache.org/jira/browse/SOLR-9599
> Project: Solr
> Issue Type: Bug
> Security Level: Public(Default Security Level. Issues are Public)
> Affects Versions: master (7.0)
> Reporter: Yonik Seeley
> Fix For: master (7.0)
>
>
> I did a quick performance comparison of faceting indexed fields (i.e.
> docvalues are not stored) using method=dv before and after the new docvalues
> iterator went in (LUCENE-7407).
> 5M document index, 21 segments, single valued string fields w/ no missing
> values.
> || field cardinality || new_time / old_time ||
> |10|2.01|
> |1000|2.02|
> |10000|1.85|
> |100000|1.56|
> |1000000|1.31|
> So unfortunately, often twice as slow.
> See followup messages for tests using real docvalues as well.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]