[ 
https://issues.apache.org/jira/browse/LUCENE-10062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17444124#comment-17444124
 ] 

Greg Miller commented on LUCENE-10062:
--------------------------------------

I'm posting a new PR now for adding this format change to the 9.0 release. The 
intention is to maintain backwards compatibility with 8.x indices, and then 
drop support for the older binary doc values format in 10 (which allows us to 
avoid some of the back-compat complexity on the main branch). The PR I'm 
posting takes a fairly aggressive approach to deprecation, and I'm curious what 
folks will think of this. I'll outline the deprecation approach here, starting 
with the less-controversial followed by the potentially more-controversial.

*Less Controversial*

Starting with Lucene 9.0, taxonomy ordinals will be stored as a 
{{SortedNumericDocValues}} field. For backwards compatibility, if Lucene 9.x 
code is writing to an index created with 8.x, it will revert back to using a 
{{BinaryDocValues}} format. In Lucene 8.x, we allow users to plug in their own 
custom binary format if they don't want the default. This will continue to work 
in 9.x, but only if writing to an 8.x index. Users will not be able to plug in 
any sort of custom format for indexes created with 9.0 onward (they'll get a 
{{SortedNumericDocValues}} field as-is).

When merging segments, we will honor the present format for backwards 
compatibility. So if the segments being merged were written with 9.x, we'll 
merge the {{SortedNumericDocValues}} fields. If we're merging 8.x segments, 
we'll maintain the older binary format (including any customization plugged in 
by the user). Again, no custom format support will be provided for 9.0 onwards.

When reading the ordinals, we'll be backwards compatible with 8.x indexes 
(using the binary format).

*Potentially Controversial*

Users currently have the ability to provide ordinals for a given document 
through the concept of an {{OrdinalsReader}} when using 
{{{}TaxonomyFacetCounts{}}}, {{TaxonomyFacetSumValueSource}} and 
{{{}TaxonomyFacetLabels{}}}. This seems like it's available mainly to support 
users that have created a custom binary format for their taxonomy ordinals. 
But, in theory, it could be useful more generally if users have some need to 
provide ordinals in some other, custom way. I propose deprecating this concept 
entirely. While it's not terribly hard to keep it around, I struggle to think 
of a real use-case for users needing to provide ordinals in a custom way if we 
no longer support the ability to plug in a custom binary format. Note that the 
other facet implementations (including things like 
{{{}FastTaxonomyFacetCounts{}}}) assume the default encoding, so they'll 
seamlessly switch from the binary format to the numeric format under-the-hood 
in a backwards-compatible fashion. If users really have some custom need, 
there's nothing preventing them from implementing their own {{Facets}} 
sub-class, etc.

If anyone knows of real-world use-cases for maintaining the support for 
{{{}OrdinalsReader{}}}, I'm happy to keep it in. I have a version of the change 
that does so already, so it's not really any extra work, it just seems a good 
opportunity to remove some code complexity.

> Explore using SORTED_NUMERIC doc values to encode taxonomy ordinals for 
> faceting
> --------------------------------------------------------------------------------
>
>                 Key: LUCENE-10062
>                 URL: https://issues.apache.org/jira/browse/LUCENE-10062
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: modules/facet
>            Reporter: Greg Miller
>            Assignee: Greg Miller
>            Priority: Minor
>          Time Spent: 3h 10m
>  Remaining Estimate: 0h
>
> We currently encode taxonomy ordinals using varint style packing in a binary 
> doc values field. I suspect there have been a number of improvements to 
> SortedNumericDocValues since taxonomy faceting was first introduced, and I 
> plan to explore replacing the custom binary format we have today with a 
> SORTED_NUMERIC type dv field instead.
> I'll report benchmark results and index size impact here.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to