[ 
https://issues.apache.org/jira/browse/SOLR-12216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Horatiu Lazu updated SOLR-12216:
--------------------------------
    Description: 
This patch is to propose the idea of extending the capabilities of the built-in 
join to allow joining across SolrClouds. Similar to streaming's search 
function, the user can directly specify the zkHost of the other SolrCloud and 
the rest of the syntax (from, to, fromIndex) can remain the same. This join 
would be triggered when the zkHost parameter is specified, containing the 
address of the other SolrCluster. It could also be packaged as a separate 
plugin.

 

In my testing, my current implementation is on average 4.5x faster than an 
equivalent streaming expression intersecting from two search queries, one of 
which streams from another collection on another SolrCloud. 
h5. How it works

Similar to the existing join, I created a QParser, but this join works as a 
post-filter. The join first populates a hash set containing fields from the 
“from” index (i.e, the index that’s not the one we’re running the query from). 
To obtain the fields, it establishes a connection with the other SolrCloud 
using SolrJ through the ZooKeeper address specified, and then uses a custom 
request handler that performs the query on the “from” index and return back an 
array of strings containing a list of fields. Then, on the “to” index, it 
iterates through the array sent as JavaBin and adds it to the hash set. After 
that, we iterate through the NumericDocList for the “to” core’s join field, and 
if there’s a value within the NumericDocList that’s found within our hash set, 
we collect it inside the DelegatingCollector.

This allows for joining across sharded collections as well. 
h5. How I benchmarked

I created web-app that first reloads the collections, then sends 25 AJAX 
requests at once to the Solr endpoint of varying query sizes (between 127 
search results and 690,000), and then recorded the results. After all responses 
are returned, the collection is reloaded, and the equivalent streaming 
expressions are tested. This process is repeated 15 times, and the average of 
the results is taken. 

Note: The first two requests are not counted in the statistics, because it 
“warms up” the collection. For reference, after bouncing Solr and at least one 
query is executed, it takes on average ~890ms for joining on two collections 
with about 690,000 results, while it takes ~4.5 seconds using streaming 
expressions).

 

I have written unit tests written as well. I would appreciate some comments on 
this. Thank you.

  was:
This patch is to propose the idea of extended the capabilities of the built-in 
join to allow joining across SolrClouds. Similar to streaming's search 
function, the user can directly specify the zkHost of the other SolrCloud and 
the rest of the syntax (from, to, fromIndex) can remain the same. This join 
would be triggered when the zkHost parameter is specified, containing the 
address of the other SolrCluster. It could also be packaged as a separate 
plugin.

 

In my testing, my current implementation is on average 4.5x faster than an 
equivalent streaming expression intersecting from two search queries, one of 
which streams from another collection on another SolrCloud. 
h5. How it works

Similar to the existing join, I created a QParser, but this join works as a 
post-filter. The join first populates a hash set containing fields from the 
“from” index (i.e, the index that’s not the one we’re running the query from). 
To obtain the fields, it establishes a connection with the other SolrCloud 
using SolrJ through the ZooKeeper address specified, and then uses a custom 
request handler that performs the query on the “from” index and return back an 
array of strings containing a list of fields. Then, on the “to” index, it 
iterates through the array sent as JavaBin and adds it to the hash set. After 
that, we iterate through the NumericDocList for the “to” core’s join field, and 
if there’s a value within the NumericDocList that’s found within our hash set, 
we collect it inside the DelegatingCollector.

This allows for joining across sharded collections as well. 
h5. How I benchmarked

I created web-app that first reloads the collections, then sends 25 AJAX 
requests at once to the Solr endpoint of varying query sizes (between 127 
search results and 690,000), and then recorded the results. After all responses 
are returned, the collection is reloaded, and the equivalent streaming 
expressions are tested. This process is repeated 15 times, and the average of 
the results is taken. 

Note: The first two requests are not counted in the statistics, because it 
“warms up” the collection. For reference, after bouncing Solr and at least one 
query is executed, it takes on average ~890ms for joining on two collections 
with about 690,000 results, while it takes ~4.5 seconds using streaming 
expressions).

 

I have written unit tests written as well. I would appreciate some comments on 
this. Thank you.


> Add support for cross-cloud join 
> ---------------------------------
>
>                 Key: SOLR-12216
>                 URL: https://issues.apache.org/jira/browse/SOLR-12216
>             Project: Solr
>          Issue Type: Improvement
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: search
>            Reporter: Horatiu Lazu
>            Priority: Trivial
>
> This patch is to propose the idea of extending the capabilities of the 
> built-in join to allow joining across SolrClouds. Similar to streaming's 
> search function, the user can directly specify the zkHost of the other 
> SolrCloud and the rest of the syntax (from, to, fromIndex) can remain the 
> same. This join would be triggered when the zkHost parameter is specified, 
> containing the address of the other SolrCluster. It could also be packaged as 
> a separate plugin.
>  
> In my testing, my current implementation is on average 4.5x faster than an 
> equivalent streaming expression intersecting from two search queries, one of 
> which streams from another collection on another SolrCloud. 
> h5. How it works
> Similar to the existing join, I created a QParser, but this join works as a 
> post-filter. The join first populates a hash set containing fields from the 
> “from” index (i.e, the index that’s not the one we’re running the query 
> from). To obtain the fields, it establishes a connection with the other 
> SolrCloud using SolrJ through the ZooKeeper address specified, and then uses 
> a custom request handler that performs the query on the “from” index and 
> return back an array of strings containing a list of fields. Then, on the 
> “to” index, it iterates through the array sent as JavaBin and adds it to the 
> hash set. After that, we iterate through the NumericDocList for the “to” 
> core’s join field, and if there’s a value within the NumericDocList that’s 
> found within our hash set, we collect it inside the DelegatingCollector.
> This allows for joining across sharded collections as well. 
> h5. How I benchmarked
> I created web-app that first reloads the collections, then sends 25 AJAX 
> requests at once to the Solr endpoint of varying query sizes (between 127 
> search results and 690,000), and then recorded the results. After all 
> responses are returned, the collection is reloaded, and the equivalent 
> streaming expressions are tested. This process is repeated 15 times, and the 
> average of the results is taken. 
> Note: The first two requests are not counted in the statistics, because it 
> “warms up” the collection. For reference, after bouncing Solr and at least 
> one query is executed, it takes on average ~890ms for joining on two 
> collections with about 690,000 results, while it takes ~4.5 seconds using 
> streaming expressions).
>  
> I have written unit tests written as well. I would appreciate some comments 
> on this. Thank you.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to