[ 
https://issues.apache.org/jira/browse/SOLR-15008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17235982#comment-17235982
 ] 

Radu Gheorghe commented on SOLR-15008:
--------------------------------------

Thanks a lot for replying so quickly, Michael!

>  {{OrdinalMap}} instances (as accessed via {{FacetFieldProcessorByArrayDV}}  
>are already cached

That's great, I didn't catch that part of the code! Though a bit confusing 
because we have autoSoftCommit=60000. To get the profile data, I ran the same 
facet in a loop for 2 minutes. Most of the runs were N seconds, though some 
were instant. But only some. I would expect that, if cached, most facets would 
be instant and only some would take N seconds.

On the collection I've been testing, there aren't that many writes and I see 
there are no manual soft commits, either. I'm attaching a screenshot with this 
data.

>  Do you have more information about the total numbers involved 
>(high-cardinality field – specifically how high per core?  how many documents 
>overall per core? how many cores?

Yes, let me start from high-level to low-level. This is a 24-node cluster. Each 
collection (there are about 80 of them) has 8 shards and replicationFactor=3 => 
one core per box. The collection I've tested with is 10M docs and 10GB over 26 
segments (most of which are 512MB - the configured maxSegmentSize). There are 
4.35M unique values in the faceted field.

Load on the server is low (0.5 over 8 cores), though I see one vCPU spike to 
100% when I run the facet.

>  does the latency manifest even across a single indexSearcher – i.e., no 
>intervening updates?)

I can't really stop updates, but I have high latency when I query a single 
core, too. If I query on a loop (e.g. surely some facets are done between soft 
commits) the latency is still there for most of the time. I'm not sure if this 
qualifies as reusing the indexSearcher. These are reused between softCommits, 
no?

>  disable refinement for the facet field

I tried now and I see about the same latency.

>  try optimizing each replica to a single segment

I can't do this on the mentioned collection, because it's too big and recent 
(this is last month's data, there are monthly collections over 2 years). I 
tried this with a smaller&older collection where I can reproduce the issue (1 
year ago: 2GB over 13 segments, went down to 1.6GB) and the facet_module time 
went down from 4000+ to 80(ms). This is with debug=all.

Another thing worth noting was that the query on the 10GB collection returned 
300 docs, while on the old one it returned 700. Latency seems more related with 
the collection size (or maybe number of unique terms?) rather than the number 
of documents returned. For example, facet_module time showed 20000+ on the 10GB 
collection this morning.

>  {{FacetFieldProcessorByHashDV}} was designed to meet this a similar use 
>case, but it only works for single-valued fields

Right, this is a multi-valued field. My current plan is to reproduce this issue 
in isolation (i.e. my laptop) and if it looks the same to implement the "facet 
on actual values" patch. Would that work? If yes, any pointers will be greatly 
appreciated :D

Thanks again!

> Avoid building OrdinalMap for each facet
> ----------------------------------------
>
>                 Key: SOLR-15008
>                 URL: https://issues.apache.org/jira/browse/SOLR-15008
>             Project: Solr
>          Issue Type: Improvement
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: Facet Module
>    Affects Versions: 8.7
>            Reporter: Radu Gheorghe
>            Priority: Major
>              Labels: performance
>         Attachments: Screenshot 2020-11-19 at 12.01.55.png, writes_commits.png
>
>
> I'm running against the following scenario:
>  * [JSON] faceting on a high cardinality field
>  * few matching documents => few unique values
> Yet the query almost always takes a long time. Here's an example taking 
> almost 4s for ~300 documents and unique values (edited a bit):
>  
> {code:java}
>     "QTime":3869,
>     "params":{
>       "json":"{\"query\": \"*:*\",
>       \"filter\": [\"type:test_type\", \"date:[1603670360 TO 1604361599]\", 
> \"unique_id:49866\"]
>       \"facet\": 
> {\"keywords\":{\"type\":\"terms\",\"field\":\"keywords\",\"limit\":20,\"mincount\":20}}}",
>       "rows":"0"}},
>   
> "response":{"numFound":333,"start":0,"maxScore":1.0,"numFoundExact":true,"docs":[]
>   },
>   "facets":{
>     "count":333,
>     "keywords":{
>       "buckets":[{
>           "val":"value1",
>           "count":124},
>   ...
> {code}
> I did some [profiling with our Sematext 
> Monitoring|https://sematext.com/docs/monitoring/on-demand-profiling/] and it 
> points me to OrdinalMap building (see attached screenshot). If I read the 
> code right, an OrdinalMap is built with every facet. And it's expensive since 
> there are many unique values in the shard (previously, there we more smaller 
> shards, making latency better, but this approach doesn't scale for this 
> particular use-case).
> If I'm right up to this point, I see a couple of potential improvements, 
> [inspired from 
> Elasticsearch|#search-aggregations-bucket-terms-aggregation-execution-hint]:
>  # *Keep the OrdinalMap cached until the next softCommit*, so that only the 
> first query takes the penalty
>  # *Allow faceting on actual values (a Map) rather than ordinals*, for 
> situations like the one above where we have few matching documents. We could 
> potentially auto-detect this scenario (e.g. by configuring a threshold) and 
> use a Map when there are few documents
> I'm curious about what you're thinking:
>  * would a PR/patch be welcome for any of the two ideas above?
>  * do you see better options? am I missing something?
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to