Horatiu Lazu commented on SOLR-12216:

I'm still doing some finishing touches, hoping to get it out soon, but would 
like some feedback at this point on the idea itself. 

> Add support for cross-cloud join 
> ---------------------------------
>                 Key: SOLR-12216
>                 URL: https://issues.apache.org/jira/browse/SOLR-12216
>             Project: Solr
>          Issue Type: Improvement
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: search
>            Reporter: Horatiu Lazu
>            Priority: Major
> This patch is to propose the idea of extending the capabilities of the 
> built-in join to allow joining across SolrClouds. Similar to streaming's 
> search function, the user can directly specify the zkHost of the other 
> SolrCloud and the rest of the syntax (from, to, fromIndex) can remain the 
> same. This join would be triggered when the zkHost parameter is specified, 
> containing the address of the other SolrCluster. It could also be packaged as 
> a separate plugin.
> In my testing, my current implementation is on average 4.5x faster than an 
> equivalent streaming expression intersecting from two search queries, one of 
> which streams from another collection on another SolrCloud. 
> h5. How it works
> Similar to the existing join, I created a QParser, but this join works as a 
> post-filter. The join first populates a hash set containing fields from the 
> “from” index (i.e, the index that’s not the one we’re running the query 
> from). To obtain the fields, it establishes a connection with the other 
> SolrCloud using SolrJ through the ZooKeeper address specified, and then uses 
> a custom request handler that performs the query on the “from” index and 
> return back an array of strings containing a list of fields. Then, on the 
> “to” index, it iterates through the array sent as JavaBin and adds it to the 
> hash set. After that, we iterate through the NumericDocList for the “to” 
> core’s join field, and if there’s a value within the NumericDocList that’s 
> found within our hash set, we collect it inside the DelegatingCollector.
> This allows for joining across sharded collections as well. 
> h5. How I benchmarked
> I created web-app that first reloads the collections, then sends 25 AJAX 
> requests at once to the Solr endpoint of varying query sizes (between 127 
> search results and 690,000), and then recorded the results. After all 
> responses are returned, the collection is reloaded, and the equivalent 
> streaming expressions are tested. This process is repeated 15 times, and the 
> average of the results is taken. 
> Note: The first two requests are not counted in the statistics, because it 
> “warms up” the collection. For reference, after bouncing Solr and at least 
> one query is executed, it takes on average ~890ms for joining on two 
> collections with about 690,000 results, while it takes ~4.5 seconds using 
> streaming expressions).
> I have written unit tests written as well. I would appreciate some comments 
> on this. Thank you.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to