[jira] [Updated] (SOLR-12178) Improve efficiency of distributed random sampling

Cassandra Targett (Jira) Fri, 13 Aug 2021 14:07:04 -0700


     [ 
https://issues.apache.org/jira/browse/SOLR-12178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Cassandra Targett updated SOLR-12178:
-------------------------------------
    Component/s: streaming expressions

> Improve efficiency of distributed random sampling
> -------------------------------------------------
>
>                 Key: SOLR-12178
>                 URL: https://issues.apache.org/jira/browse/SOLR-12178
>             Project: Solr
>          Issue Type: Improvement
>          Components: streaming expressions
>            Reporter: Joel Bernstein
>            Assignee: Joel Bernstein
>            Priority: Major
>             Fix For: 8.1, main (9.0)
>
>
> Currently the *random* Streaming Expression performs a distributed random 
> sampling using *CloudSolrClient*. This means that a random sample of *N* docs 
> from each shard is read into memory on the aggregator node and then a page of 
> *N* docs is created from the samples from each shard. Reading all the samples 
> from the shards into memory in the aggregator node means the memory 
> consumption for random sampling grows as a function of: N*numshards. This 
> clearly limits both N and numshards.
> This ticket will change the random sampling approach to an approach similar 
> to the one used in *CloudSolrStream* where a stream is generated from the 
> shards without reading all the documents into memory.
> When combined with SOLR-12159 this will allow for much larger random samples. 
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (SOLR-12178) Improve efficiency of distributed random sampling

Reply via email to