[ https://issues.apache.org/jira/browse/LUCENE-9950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17340241#comment-17340241 ]
Greg Miller commented on LUCENE-9950: ------------------------------------- I've started digging into this code a bit and find myself a little confused on the role of {{SortedSetDocValueFacetCounts}} and the best approach for moving forward with this idea. Taking a step back from thinking about single- vs. multi-valued support, I was a little surprised to find that SSDV facet counting makes some [pretty strict assumptions|https://github.com/apache/lucene/blob/main/lucene/facet/src/java/org/apache/lucene/facet/sortedset/DefaultSortedSetDocValuesReaderState.java#L95] about the format of the SSDV values. Specifically, it assumes that each value represents a strict two-level facet "path" in the form of "dimension/value". In contrast to this, looking at something like {{LongValueFacetCounts}} or {{RangeFacetCounts}}, the approach makes no assumptions about the stored doc values. These facet counting implementations can be pointed to any numeric doc value field, while {{SortedSetDocValueFacetCounts}} has to be pointed at a field that's indexed in a very specific way. In fact, it looks like most users of this functionality will add {{SortedSetDocValuesFacetField}} to their document and rely on {{FacetsConfig#build}} to create the doc value field in the proper format. With all this in mind, I wonder if it makes sense to add a new facet counting implementation that makes no assumptions about what is stored in the doc value field (other than being string content – i.e., {{SortedSetDocValues}} or {{SortedDocValues}}), and implement counting functionality similar to {{LongValueFacetCounts}}. This would assume "flat" values in each field, where the field is effectively equivalent to the "dimension" (e.g., see the [approach|https://github.com/apache/lucene/blob/main/lucene/facet/src/java/org/apache/lucene/facet/LongValueFacetCounts.java#L226] in {{LongValueFacetCounts}}). It seems like this idea of a general string field facet counting implementation may have been behind {{SortedSetDocValueFacetCounts}} originally (LUCENE-4795)? Maybe it evolved into something different that makes these assumptions about the contents of the stored field? One advantage of its implementation is that many different "dimensions" can be stored in the same field for efficient counting, but it loses the flexibility to just dynamically count against any string doc value field. This also makes me wonder a little bit at the use-cases it's designed for, given the existence of taxonomy-based facet counting. It seems like the only advantage it might offer over a taxonomy-based approach is not requiring the side-car index? Anyway, back to the main point: I would propose adding a new type of facet counting implementation (something like "StringValueFacetCounts" as [proposed|https://issues.apache.org/jira/browse/LUCENE-9946?focusedCommentId=17337741&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17337741] by [~rcmuir ]), that has no requirements on the content stored in the field (which could be single- or multi-valued), and simply counts unique values in the same way {{LongValueFacetCounts}} does. Thoughts on this approach? > Support both single- and multi-value string fields in facet counting > (non-taxonomy based approaches) > ---------------------------------------------------------------------------------------------------- > > Key: LUCENE-9950 > URL: https://issues.apache.org/jira/browse/LUCENE-9950 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/facet > Affects Versions: main (9.0) > Reporter: Greg Miller > Priority: Minor > > Users wanting to facet count string-based fields using a non-taxonomy-based > approach can use {{SortedSetDocValueFacetCounts}}, which accumulates facet > counts based on a {{SortedSetDocValues}} field. This requires the stored doc > values to be multi-valued (i.e., {{SORTED_SET}}), and doesn't work on > single-valued fields (i.e., SORTED). In contrast, if a user wants to facet > count on a stored numeric field, they can use {{LongValueFacetCounts}}, which > supports both single- and multi-valued fields (and in LUCENE-9948, we now > auto-detect instead of asking the user to specify). > Let's update {{SortedSetDocValueFacetCounts}} to also support, and > automatically detect single- and multi-value fields. Note that this is a > spin-off issue from LUCENE-9946, where [~rcmuir] points out that this can > essentially be a one-line change, but we may want to do some class renaming > at the same time. Also note that we should do this in > {{ConcurrentSortedSetDocValuesFacetCounts}} while we're at it. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org