[ 
https://issues.apache.org/jira/browse/LUCENE-10048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17398655#comment-17398655
 ] 

Michael McCandless commented on LUCENE-10048:
---------------------------------------------

Some brief historical context: LUCENE-8947 (and [this 
PR|https://github.com/apache/lucene-solr/pull/2080]) is where we discussed this 
(and decided against it) last time.  Please read the discussion on the PR – 
that is when [~rcmuir] convinced me removing this limit is indeed dangerous.
{quote}You are referring to 
[Payloads|https://cwiki.apache.org/confluence/display/LUCENE/Payloads] right? 
It is a viable option but less space efficient (no delta compression) compared 
to storing these values directly as term-frequencies.
{quote}
Note that term frequencies are also not delta-coded (the way Lucene docids 
are), because they are not in any sorted order.  Rather they are written in 
blocks of 128 ints, encoded with PFOR.  Though we might subtract the minimum 
value (and separately store it once) across that block and encode the deltas?  
Not sure.  We do efficiently encode the case when all 128 values are the same 
value.

Payloads would work here, but I suspect that'd be slower to access at search 
time.
{quote}[~goankur] I wonder if encoding the scores more efficiently would be an 
option, e.g. using a bfloat16?
{quote}
+1, that seems like a great idea!  Such an encoding should easily stay under 
Lucene's limits (as long as you have fewer than 64K such tokens, if all tokens 
had the max uint16 value of a bfloat16), and, that could represent the full 
range of numbers (clearly with some loss of precision).  Plus [modern CPUs seem 
to maybe optimize for this 
representation|https://en.wikipedia.org/wiki/Bfloat16_floating-point_format] 
(not sure if Hotspot can tap into that tho).  I like this path.  Though, I 
suspect compression would be worse for cases where the scores are all smallish 
positive integers, maybe.
{quote}It isn't viable to have an option that says "allow me to corrupt my 
index".
{quote}
+1, I don't think we can safely relax this limit, even under a boolean 
option/flag.  This really is an abusive use-case for Lucene.

I can totally understand why we (Amazon product search) are using this feature, 
because we are trying to store a (sometimes massive) int score/ranking signal 
per query/atom X document.  But it is quite insane that we cannot contain the 
values to fit within int32 when summed across one doc X field.

> Bypass total frequency check if field uses custom term frequency
> ----------------------------------------------------------------
>
>                 Key: LUCENE-10048
>                 URL: https://issues.apache.org/jira/browse/LUCENE-10048
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Tony Xu
>            Priority: Minor
>
> For all fields whose index option is not *IndexOptions.NONE*. There is a 
> check on per field total token count (i.e. field-length) to ensure we don't 
> index too many tokens. This is done by accumulating the token's 
> *TermFrequencyAttribute.*
>  
> Given that currently Lucene allows custom term frequency attached to each 
> token and the usage of the frequency can be pretty wild. It is possible to 
> have the following case where the check fails with only a few tokens that 
> have large frequencies. Currently Lucene will skip indexing the whole 
> document.
> *"foo|<very large number> bar|<very large number>"*
>  
> What should be way to inform the indexing chain not to check the field length?
> A related observation, when custom term frequency is in use, user is not 
> likely to use the similarity for this field. Maybe we can offer a way to 
> specify that, too?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to