[
https://issues.apache.org/jira/browse/SOLR-12178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Cassandra Targett updated SOLR-12178:
-------------------------------------
Component/s: streaming expressions
> Improve efficiency of distributed random sampling
> -------------------------------------------------
>
> Key: SOLR-12178
> URL: https://issues.apache.org/jira/browse/SOLR-12178
> Project: Solr
> Issue Type: Improvement
> Components: streaming expressions
> Reporter: Joel Bernstein
> Assignee: Joel Bernstein
> Priority: Major
> Fix For: 8.1, main (9.0)
>
>
> Currently the *random* Streaming Expression performs a distributed random
> sampling using *CloudSolrClient*. This means that a random sample of *N* docs
> from each shard is read into memory on the aggregator node and then a page of
> *N* docs is created from the samples from each shard. Reading all the samples
> from the shards into memory in the aggregator node means the memory
> consumption for random sampling grows as a function of: N*numshards. This
> clearly limits both N and numshards.
> This ticket will change the random sampling approach to an approach similar
> to the one used in *CloudSolrStream* where a stream is generated from the
> shards without reading all the documents into memory.
> When combined with SOLR-12159 this will allow for much larger random samples.
>
>
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]