[ https://issues.apache.org/jira/browse/LUCENE-10062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17444124#comment-17444124 ]
Greg Miller commented on LUCENE-10062: -------------------------------------- I'm posting a new PR now for adding this format change to the 9.0 release. The intention is to maintain backwards compatibility with 8.x indices, and then drop support for the older binary doc values format in 10 (which allows us to avoid some of the back-compat complexity on the main branch). The PR I'm posting takes a fairly aggressive approach to deprecation, and I'm curious what folks will think of this. I'll outline the deprecation approach here, starting with the less-controversial followed by the potentially more-controversial. *Less Controversial* Starting with Lucene 9.0, taxonomy ordinals will be stored as a {{SortedNumericDocValues}} field. For backwards compatibility, if Lucene 9.x code is writing to an index created with 8.x, it will revert back to using a {{BinaryDocValues}} format. In Lucene 8.x, we allow users to plug in their own custom binary format if they don't want the default. This will continue to work in 9.x, but only if writing to an 8.x index. Users will not be able to plug in any sort of custom format for indexes created with 9.0 onward (they'll get a {{SortedNumericDocValues}} field as-is). When merging segments, we will honor the present format for backwards compatibility. So if the segments being merged were written with 9.x, we'll merge the {{SortedNumericDocValues}} fields. If we're merging 8.x segments, we'll maintain the older binary format (including any customization plugged in by the user). Again, no custom format support will be provided for 9.0 onwards. When reading the ordinals, we'll be backwards compatible with 8.x indexes (using the binary format). *Potentially Controversial* Users currently have the ability to provide ordinals for a given document through the concept of an {{OrdinalsReader}} when using {{{}TaxonomyFacetCounts{}}}, {{TaxonomyFacetSumValueSource}} and {{{}TaxonomyFacetLabels{}}}. This seems like it's available mainly to support users that have created a custom binary format for their taxonomy ordinals. But, in theory, it could be useful more generally if users have some need to provide ordinals in some other, custom way. I propose deprecating this concept entirely. While it's not terribly hard to keep it around, I struggle to think of a real use-case for users needing to provide ordinals in a custom way if we no longer support the ability to plug in a custom binary format. Note that the other facet implementations (including things like {{{}FastTaxonomyFacetCounts{}}}) assume the default encoding, so they'll seamlessly switch from the binary format to the numeric format under-the-hood in a backwards-compatible fashion. If users really have some custom need, there's nothing preventing them from implementing their own {{Facets}} sub-class, etc. If anyone knows of real-world use-cases for maintaining the support for {{{}OrdinalsReader{}}}, I'm happy to keep it in. I have a version of the change that does so already, so it's not really any extra work, it just seems a good opportunity to remove some code complexity. > Explore using SORTED_NUMERIC doc values to encode taxonomy ordinals for > faceting > -------------------------------------------------------------------------------- > > Key: LUCENE-10062 > URL: https://issues.apache.org/jira/browse/LUCENE-10062 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/facet > Reporter: Greg Miller > Assignee: Greg Miller > Priority: Minor > Time Spent: 3h 10m > Remaining Estimate: 0h > > We currently encode taxonomy ordinals using varint style packing in a binary > doc values field. I suspect there have been a number of improvements to > SortedNumericDocValues since taxonomy faceting was first introduced, and I > plan to explore replacing the custom binary format we have today with a > SORTED_NUMERIC type dv field instead. > I'll report benchmark results and index size impact here. -- This message was sent by Atlassian Jira (v8.20.1#820001) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org