[
https://issues.apache.org/jira/browse/SOLR-9142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15296197#comment-15296197
]
Varun Thacker commented on SOLR-9142:
-------------------------------------
Attaching the code snippet used to create the data set:
{code:title=TestJSONFacetAPI.java|borderStyle=solid}
import org.apache.lucene.util.TestUtil;
import org.apache.solr.client.solrj.SolrServerException;
import org.apache.solr.client.solrj.impl.HttpSolrClient;
import org.apache.solr.common.SolrInputDocument;
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;
import java.util.Random;
public class TestJSONFacetAPI {
public static void main(String args[]) throws IOException,
SolrServerException {
Random r = new Random();
HttpSolrClient client = new HttpSolrClient("http://localhost:8983/solr");
client.deleteByQuery("techproducts", "*:*");
List<SolrInputDocument> docs = new ArrayList<>(10000);
for (int i=0; i<2000000; i++) {
SolrInputDocument document = new SolrInputDocument();
document.addField("id", i);
document.addField("top_facet_s", i%1000);
document.addField("sub_facet_unique_s", TestUtil.randomSimpleString(r, 3,
10) + " " + TestUtil.randomSimpleString(r, 3, 10));
document.addField("sub_facet_unique_td", i);
document.addField("sub_facet_limited_s", i%5);
document.addField("sub_facet_limited_td", i%5);
docs.add(document);
if (i%10000 ==0) {
client.add("techproducts", docs);
client.commit("techproducts");
docs.clear();
System.out.println(i);
}
}
client.add("techproducts", docs);
client.commit("techproducts");
}
}
{code}
> Improve JSON nested facets effeciency
> -------------------------------------
>
> Key: SOLR-9142
> URL: https://issues.apache.org/jira/browse/SOLR-9142
> Project: Solr
> Issue Type: Bug
> Reporter: Varun Thacker
>
> I indexed a dataset of 2M docs
> {{top_facet_s}} has a cardinality of 1000 which is the top level facet.
> For nested facets it has two fields {{sub_facet_unique_s}} and
> {{sub_facet_unique_td}} which are string and double and have cardinality 2M
> The nested query for the double field returns in the 1s mark always. The
> nested query for the string field takes roughly 10s to execute.
> {code:title=nested string facet|borderStyle=solid}
> q=*:*&rows=0&json.facet=
> {
> "top_facet_s": {
> "type": "terms",
> "limit": -1,
> "field": "top_facet_s",
> "mincount": 1,
> "excludeTags": "ANY",
> "facet": {
> "sub_facet_unique_s": {
> "type": "terms",
> "limit": 1,
> "field": "sub_facet_unique_s",
> "mincount": 1
> }
> }
> }
> }
> {code}
> {code:title=nested double facet|borderStyle=solid}
> q=*:*&rows=0&json.facet=
> {
> "top_facet_s": {
> "type": "terms",
> "limit": -1,
> "field": "top_facet_s",
> "mincount": 1,
> "excludeTags": "ANY",
> "facet": {
> "sub_facet_unique_s": {
> "type": "terms",
> "limit": 1,
> "field": "sub_facet_unique_td",
> "mincount": 1
> }
> }
> }
> }
> {code}
> I tried to dig deeper to understand why are string nested faceting that slow
> compared to numeric field
> Since the top facet has a cardinality of 1000 we have to calculate sub facets
> on each of them. Now the key difference was in the implementation of the two .
> For the string field, In {{FacetField#getFieldCacheCounts}} we call
> {{createCollectAcc}} with nDocs=0 and numSlots=2M . This then initializes an
> array of 2M. So we create a 2M array 1000 times for this one query which from
> what I understand makes this query slow.
> For numeric fields {{FacetFieldProcessorNumeric#calcFacets}} uses a
> CountSlotAcc which doesn't assign a huge array. In this query it calls
> {{createCollectAcc}} with numDocs=2k and numSlots=1024 .
> In string faceting, we create the 2M array because the cardinality is 2M and
> we use the array position as the ordinal and value as the count. If we could
> improve on this it would speed things up significantly? For sub-facets we
> know the maximum cardinality can be at max the top level bucket count.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]