[ 
https://issues.apache.org/jira/browse/LUCENE-9950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17340241#comment-17340241
 ] 

Greg Miller commented on LUCENE-9950:
-------------------------------------

I've started digging into this code a bit and find myself a little confused on 
the role of {{SortedSetDocValueFacetCounts}} and the best approach for moving 
forward with this idea. Taking a step back from thinking about single- vs. 
multi-valued support, I was a little surprised to find that SSDV facet counting 
makes some [pretty strict 
assumptions|https://github.com/apache/lucene/blob/main/lucene/facet/src/java/org/apache/lucene/facet/sortedset/DefaultSortedSetDocValuesReaderState.java#L95]
 about the format of the SSDV values. Specifically, it assumes that each value 
represents a strict two-level facet "path" in the form of "dimension/value".

In contrast to this, looking at something like {{LongValueFacetCounts}} or 
{{RangeFacetCounts}}, the approach makes no assumptions about the stored doc 
values. These facet counting implementations can be pointed to any numeric doc 
value field, while {{SortedSetDocValueFacetCounts}} has to be pointed at a 
field that's indexed in a very specific way. In fact, it looks like most users 
of this functionality will add {{SortedSetDocValuesFacetField}} to their 
document and rely on {{FacetsConfig#build}} to create the doc value field in 
the proper format.

With all this in mind, I wonder if it makes sense to add a new facet counting 
implementation that makes no assumptions about what is stored in the doc value 
field (other than being string content – i.e., {{SortedSetDocValues}} or 
{{SortedDocValues}}), and implement counting functionality similar to 
{{LongValueFacetCounts}}. This would assume "flat" values in each field, where 
the field is effectively equivalent to the "dimension" (e.g., see the 
[approach|https://github.com/apache/lucene/blob/main/lucene/facet/src/java/org/apache/lucene/facet/LongValueFacetCounts.java#L226]
 in {{LongValueFacetCounts}}).

It seems like this idea of a general string field facet counting implementation 
may have been behind {{SortedSetDocValueFacetCounts}} originally (LUCENE-4795)? 
Maybe it evolved into something different that makes these assumptions about 
the contents of the stored field? One advantage of its implementation is that 
many different "dimensions" can be stored in the same field for efficient 
counting, but it loses the flexibility to just dynamically count against any 
string doc value field. This also makes me wonder a little bit at the use-cases 
it's designed for, given the existence of taxonomy-based facet counting. It 
seems like the only advantage it might offer over a taxonomy-based approach is 
not requiring the side-car index?

Anyway, back to the main point: I would propose adding a new type of facet 
counting implementation (something like "StringValueFacetCounts" as 
[proposed|https://issues.apache.org/jira/browse/LUCENE-9946?focusedCommentId=17337741&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17337741]
 by [~rcmuir ]), that has no requirements on the content stored in the field 
(which could be single- or multi-valued), and simply counts unique values in 
the same way {{LongValueFacetCounts}} does. Thoughts on this approach?

> Support both single- and multi-value string fields in facet counting 
> (non-taxonomy based approaches)
> ----------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-9950
>                 URL: https://issues.apache.org/jira/browse/LUCENE-9950
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: modules/facet
>    Affects Versions: main (9.0)
>            Reporter: Greg Miller
>            Priority: Minor
>
> Users wanting to facet count string-based fields using a non-taxonomy-based 
> approach can use {{SortedSetDocValueFacetCounts}}, which accumulates facet 
> counts based on a {{SortedSetDocValues}} field. This requires the stored doc 
> values to be multi-valued (i.e., {{SORTED_SET}}), and doesn't work on 
> single-valued fields (i.e., SORTED). In contrast, if a user wants to facet 
> count on a stored numeric field, they can use {{LongValueFacetCounts}}, which 
> supports both single- and multi-valued fields (and in LUCENE-9948, we now 
> auto-detect instead of asking the user to specify).
> Let's update {{SortedSetDocValueFacetCounts}} to also support, and 
> automatically detect single- and multi-value fields. Note that this is a 
> spin-off issue from LUCENE-9946, where [~rcmuir] points out that this can 
> essentially be a one-line change, but we may want to do some class renaming 
> at the same time. Also note that we should do this in 
> {{ConcurrentSortedSetDocValuesFacetCounts}} while we're at it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to