[jira] [Comment Edited] (SOLR-11733) json.facet refinement fails to bubble up some long tail (overrequested) terms?

Yonik Seeley (JIRA) Thu, 07 Dec 2017 04:39:37 -0800

    [ 
https://issues.apache.org/jira/browse/SOLR-11733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16281258#comment-16281258
 ]


Yonik Seeley edited comment on SOLR-11733 at 12/7/17 12:38 PM:
---------------------------------------------------------------

I mentioned in SOLR-11729 the refinement algorithm being different (and for a 
single-level facet field, simpler).
It can be explained as:
1) find buckets to return as if you weren't doing refinement
2) for those buckets, make sure all shards have contributed to the statistics
i.e. simple refinement doesn't change the buckets you get back.

I started with the simplest for obvious reasons... to get something out.  From 
a correctness POV, smarter faceting is equivalent to increasing the overrequest 
amount... we still can't make guarantees.
We could easily implement a mode for some field facets that does the "could 
this possibly be in the top N" logic to consider more buckets in the first 
phase... but only if it's not a sub-facet of another partial facet (a facet 
with something like a limit).  If we're sorting by something other than count 
(like stddev for instance) then I guess we'd have to discard smart pruning and 
just try to get all buckets we saw in the first phase.

If a partial facet is a sub-facet of another partial-facet, the logic of what 
one can exclude seems to get harder, and then sub-facets need to add new 
candidate buckets to parent facets (I think? need to think about it more... but 
I guess that's part of my point ;-).  Good ideas perhaps, but definitely more 
difficult to implement.

Other refinement implementations could range all the way to "exact"... 
guarantee that no buckets are missed, and there's more than one way to go about 
that too.




was (Author: [email protected]):
I mentioned in SOLR-11729 the refinement algorithm being different (and for a 
single-level facet field, simpler).
It can be explained as:
1) find buckets to return as if you weren't doing refinement
2) for those buckets, make sure all shards have contributed to the statistics

I started with the simplest for obvious reasons... to get something out.  From 
a correctness POV, smarter faceting is equivalent to increasing the overrequest 
amount... we still can't make guarantees.
We could easily implement a mode for some field facets that does the "could 
this possibly be in the top N" logic to consider more buckets in the first 
phase... but only if it's not a sub-facet of another partial facet (a facet 
with something like a limit).
If a partial facet is a sub-facet of another partial-facet, the logic of what 
one can exclude seems to get harder, and then sub-facets need to add new 
candidate buckets to parent facets (I think? need to think about it more... but 
I guess that's part of my point ;-).  Good ideas perhaps, but definitely more 
difficult to implement.

Other refinement implementations could range all the way to "exact"... 
guarantee that no buckets are missed, and there's more than one way to go about 
that too.



> json.facet refinement fails to bubble up some long tail (overrequested) terms?
> ------------------------------------------------------------------------------
>
>                 Key: SOLR-11733
>                 URL: https://issues.apache.org/jira/browse/SOLR-11733
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>            Reporter: Hoss Man
>
> Something wonky is happening with {{json.facet}} refinement.
> "Long Tail" terms that may not be in the "top n" on every shard, but are in 
> the "top n + overrequest" for at least 1 shard aren't getting refined and 
> included in the aggragated response in some cases.
> I don't understand the code enough to explain this, but I have some steps to 
> reproduce that i'll post in a comment shortly



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (SOLR-11733) json.facet refinement fails to bubble up some long tail (overrequested) terms?

Reply via email to