Re: Solr streaming expressions high level design

Joel Bernstein Mon, 15 Jun 2015 15:01:37 -0700

The links I provided cover some of the high level stuff. What's not really
discussed is how the shuffling happens inside the search engine.

There are two parts to the shuffling:

SORTING

The /export handler which uses the SortingResponseWriter handles the
sorting of entire result sets. Heres a link to the implementation:

https://svn.apache.org/repos/asf/lucene/dev/trunk/solr/core/src/java/org/apache/solr/response/SortingResponseWriter.java

Basically what's it's doing is iterating over a DocSet (bitset) collecting
the Top 30,000 results in a priority queue and sends them out. Then it
turns off the bits that were already sent and iterates over the DocSet
again and collects the next 30,000 and sends them out. It starts sending
documents in milli-seconds and continues sending until all the documents
are finished

It pulls all field data from DocValues caches which can be configured to
trade of performance for memory by specifying DocValues format type.

In some tests I've seen export performance of up to 500,000 docs per second
from a single node. But there are many variables that affect export
performance so it can be significantly less depending on how many sort
fields and export fields there are, what field types they are and what doc
values format is being used for the fields.

For something like time series data where most or all of the sort and
export fields can be expressed as a numeric field, export performance could
be very high.

PARTITIONING

Result sets are partitioned by the HashQParserPlugin which builds Lucene
filters that hash partition search results by arbitrary fields in the
documents. Once the filters are built, the partitioning is extremely fast.

https://svn.apache.org/repos/asf/lucene/dev/trunk/solr/core/src/java/org/apache/solr/search/HashQParserPlugin.java

Streaming Expressions are sent to a worker collection to be executed in
parallel. Each worker node sets the Hash partitioning filter onto the
Streaming Expression and then opens the stream. Under the covers the search
engine will stream back a specific partition of the result set to each
worker.

Joel Bernstein
http://joelsolr.blogspot.com/

On Mon, Jun 15, 2015 at 5:00 PM, Joel Bernstein <[email protected]> wrote:

> Under the hood streaming expressions are compiled to Streaming API
> objects. Some documentation on the Streaming API can be found here:
>
> http://joelsolr.blogspot.com/2015/03/parallel-computing-with-solrcloud.html
> http://joelsolr.blogspot.com/2015/04/the-streaming-api-solrjio-basics.html
> http://joelsolr.blogspot.com/2015/04/solrjio-using-decorators.html
>
> http://joelsolr.blogspot.com/2015/04/solrjio-computing-complement-of-two.html
> http://joelsolr.blogspot.com/2015/04/in-line-streaming-aggregation.html
>
> http://joelsolr.blogspot.com/2015/04/parallel-streaming-transformations.html
>
> It might also be useful to see how the new parallel SQL handler works.
> This compiles SQL statements to the Streaming API.
>
>
> https://svn.apache.org/repos/asf/lucene/dev/trunk/solr/core/src/java/org/apache/solr/handler/SQLHandler.java
>
>
>
> Joel Bernstein
> http://joelsolr.blogspot.com/
>
> On Mon, Jun 15, 2015 at 3:45 PM, Hrishikesh Gadre <[email protected]>
> wrote:
>
>> Hi,
>>
>> Do we have any doc describing how Solr streaming expressions are
>> implemented under the hood?
>>
>> This wiki page is more user centric,
>> https://cwiki.apache.org/confluence/display/solr/Streaming+Expressions
>>
>> Regards
>> Hrishikesh
>>
>
>

Re: Solr streaming expressions high level design

Reply via email to