[
https://issues.apache.org/jira/browse/SOLR-9978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15859868#comment-15859868
]
Joel Bernstein commented on SOLR-9978:
--------------------------------------
You'll want to be sure and test with high cardinality collapse fields.
Something like 1,000,000 unique groups. This is the use case that collapse was
really designed for. Low cardinality use cases are probably better suited for
grouping.
> Reduce collapse query memory usage
> ----------------------------------
>
> Key: SOLR-9978
> URL: https://issues.apache.org/jira/browse/SOLR-9978
> Project: Solr
> Issue Type: Bug
> Security Level: Public(Default Security Level. Issues are Public)
> Reporter: Varun Thacker
> Assignee: Varun Thacker
> Attachments: SOLR-9978.patch, SOLR-9978.patch
>
>
> - Single shard test with one replica
> - 10M documents and 9M of those documents are unique. Test was for string
> - Collapse query parser creates two arrays :
> - int array for unique documents ( 9M in this case )
> - float array for the corresponding scores ( 9M in this case )
> - It goes through all documents and puts the document in the array if the
> score is better than the previously existing score.
> - So collapse creates a lot of garbage when the total number of documents is
> high and the duplicates is very less
> - Even for a query like this {{q={!cache=false}*:*&fq={!collapse
> field=collapseField_s cache=false}&sort=id desc}}
> which has a top level sort , the collapse query parser creates the score
> array and scores every document
> Indexing script used to generate dummy data:
> {code}
> //Index 10M documents , with every 1/10 document as a duplicate.
> List<SolrInputDocument> docs = new ArrayList<>(1000);
> for(int i=0; i<1000*1000*10; i++) {
> SolrInputDocument doc = new SolrInputDocument();
> doc.addField("id", i);
> if (i%10 ==0 && i!=0) {
> doc.addField("collapseField_s", i-1);
> } else {
> doc.addField("collapseField_s", i);
> }
> docs.add(doc);
> if (docs.size() == 1000) {
> client.add("ct", docs);
> docs.clear();
> }
> }
> client.commit("ct");
> {code}
> Query:
> {{q=\{!cache=false\}*:*&fq=\{!collapse field=collapseField_s
> cache=false\}&sort=id desc}}
> Improvements
> - We currently default to the SCORE implementation if no min|max|sort param
> is provided in the collapse query. Check if a global sort is provided and
> don't score documents picking the first occurrence of each unique value.
> - Instead of creating an array for unique documents use a bitset
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]