[ 
https://issues.apache.org/jira/browse/SOLR-13494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joel Bernstein updated SOLR-13494:
----------------------------------
    Description: 
Currently the *random* Streaming Expression performs a conventional distributed 
search. This involves retrieving the top N docs from each shard and then 
selecting the top N from all the shards in the aggregator node. This technique 
eventually bogs down as the number of shards goes up and/or N goes up. 

Selecting distributed random samples does not actually require this behavior. 
Instead you can select N/numShards from each shard and simply return all 
results. This technique will actually get faster as more shards are added 
instead of slowing down.

This ticket will allow the random Streaming Expression to use the strategy 
above when N reaches a certain threshold (ie 10000). 

The *DeepRandomStream* class will implement the deep random sampling behavior.

The random Streaming Expression will switch between the RandomStream and 
DeepRandomStream depending on N.

*Performance*

Local testing shows astounding performance on random sampling with the new 
technique. 

Selecting a random sample of *250,000* documents with two numeric fields and 
running a regression analysis on the sample set takes *under a second*. 
Attached is a screen shot with the math expression code.

 

 

 

  was:
Currently the *random* Streaming Expression performs a conventional distributed 
search. This involves retrieving the top N docs from each shard and then 
selecting the top N from all the shards in the aggregator node. This technique 
eventually bogs down as the number of shards goes up and/or N goes up. 

Selecting distributed random samples does not actually require this behavior. 
Instead you can select N/numShards from each shard and simply return all 
results. This technique will actually get faster as more shards are added 
instead of slowing down.

This ticket will allow the random Streaming Expression to use the strategy 
above when N reaches a certain threshold (ie 10000). 

The *DeepRandomStream* class will implement the deep random sampling behavior.

The random Streaming Expression will switch between the RandomStream and 
DeepRandomStream depending on N.

Local testing shows astounding performance on random sampling with the new 
technique. 

Selecting a random sample of 250,000 documents with two numeric fields and 
running a regression analysis on the sample set takes under a second. Attached 
is a screen shot with the math expression code.

 

 

 


> Add DeepRandomStream implementation
> -----------------------------------
>
>                 Key: SOLR-13494
>                 URL: https://issues.apache.org/jira/browse/SOLR-13494
>             Project: Solr
>          Issue Type: Improvement
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: streaming expressions
>            Reporter: Joel Bernstein
>            Assignee: Joel Bernstein
>            Priority: Major
>         Attachments: SOLR-13494.patch, SOLR-13494.patch, Screen Shot 
> 2019-05-28 at 4.50.54 PM.png
>
>
> Currently the *random* Streaming Expression performs a conventional 
> distributed search. This involves retrieving the top N docs from each shard 
> and then selecting the top N from all the shards in the aggregator node. This 
> technique eventually bogs down as the number of shards goes up and/or N goes 
> up. 
> Selecting distributed random samples does not actually require this behavior. 
> Instead you can select N/numShards from each shard and simply return all 
> results. This technique will actually get faster as more shards are added 
> instead of slowing down.
> This ticket will allow the random Streaming Expression to use the strategy 
> above when N reaches a certain threshold (ie 10000). 
> The *DeepRandomStream* class will implement the deep random sampling behavior.
> The random Streaming Expression will switch between the RandomStream and 
> DeepRandomStream depending on N.
> *Performance*
> Local testing shows astounding performance on random sampling with the new 
> technique. 
> Selecting a random sample of *250,000* documents with two numeric fields and 
> running a regression analysis on the sample set takes *under a second*. 
> Attached is a screen shot with the math expression code.
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to