The links I provided cover some of the high level stuff. What's not really discussed is how the shuffling happens inside the search engine.
There are two parts to the shuffling: SORTING The /export handler which uses the SortingResponseWriter handles the sorting of entire result sets. Heres a link to the implementation: https://svn.apache.org/repos/asf/lucene/dev/trunk/solr/core/src/java/org/apache/solr/response/SortingResponseWriter.java Basically what's it's doing is iterating over a DocSet (bitset) collecting the Top 30,000 results in a priority queue and sends them out. Then it turns off the bits that were already sent and iterates over the DocSet again and collects the next 30,000 and sends them out. It starts sending documents in milli-seconds and continues sending until all the documents are finished It pulls all field data from DocValues caches which can be configured to trade of performance for memory by specifying DocValues format type. In some tests I've seen export performance of up to 500,000 docs per second from a single node. But there are many variables that affect export performance so it can be significantly less depending on how many sort fields and export fields there are, what field types they are and what doc values format is being used for the fields. For something like time series data where most or all of the sort and export fields can be expressed as a numeric field, export performance could be very high. PARTITIONING Result sets are partitioned by the HashQParserPlugin which builds Lucene filters that hash partition search results by arbitrary fields in the documents. Once the filters are built, the partitioning is extremely fast. https://svn.apache.org/repos/asf/lucene/dev/trunk/solr/core/src/java/org/apache/solr/search/HashQParserPlugin.java Streaming Expressions are sent to a worker collection to be executed in parallel. Each worker node sets the Hash partitioning filter onto the Streaming Expression and then opens the stream. Under the covers the search engine will stream back a specific partition of the result set to each worker. Joel Bernstein http://joelsolr.blogspot.com/ On Mon, Jun 15, 2015 at 5:00 PM, Joel Bernstein <[email protected]> wrote: > Under the hood streaming expressions are compiled to Streaming API > objects. Some documentation on the Streaming API can be found here: > > http://joelsolr.blogspot.com/2015/03/parallel-computing-with-solrcloud.html > http://joelsolr.blogspot.com/2015/04/the-streaming-api-solrjio-basics.html > http://joelsolr.blogspot.com/2015/04/solrjio-using-decorators.html > > http://joelsolr.blogspot.com/2015/04/solrjio-computing-complement-of-two.html > http://joelsolr.blogspot.com/2015/04/in-line-streaming-aggregation.html > > http://joelsolr.blogspot.com/2015/04/parallel-streaming-transformations.html > > It might also be useful to see how the new parallel SQL handler works. > This compiles SQL statements to the Streaming API. > > > https://svn.apache.org/repos/asf/lucene/dev/trunk/solr/core/src/java/org/apache/solr/handler/SQLHandler.java > > > > Joel Bernstein > http://joelsolr.blogspot.com/ > > On Mon, Jun 15, 2015 at 3:45 PM, Hrishikesh Gadre <[email protected]> > wrote: > >> Hi, >> >> Do we have any doc describing how Solr streaming expressions are >> implemented under the hood? >> >> This wiki page is more user centric, >> https://cwiki.apache.org/confluence/display/solr/Streaming+Expressions >> >> Regards >> Hrishikesh >> > >
