[
https://issues.apache.org/jira/browse/LUCENE-10048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17398355#comment-17398355
]
Ankur commented on LUCENE-10048:
--------------------------------
Thanks for your response [~rcmuir].
Let me try to explain the use case at a high level
# An offline (map-reduce style) batch process consumes a set of indexable
documents.
# The process also consumes terms and metadata information from external data
sources.
# For each indexable document, the batch-process computes a set of term-doc
scores and add this set to a document field (to be indexed later).
# A document will only have a small number of such terms in a field, *less
than 10K*.
# There could be *many such fields* in a single document populated by
different offline processes, all of which scale these values arbitrarily (due
to historical reasons) but still make sure a single value fits in 4-bytes.
# The document also has usual textual fields (title, description etc) for
which Lucene computes term/field statistics and produces BM25 scores.
# All of these scores are used by a ranking method.
You are referring to
[Payloads|https://cwiki.apache.org/confluence/display/LUCENE/Payloads] right?
It is a viable option but less space efficient (no delta compression) compared
to storing these values directly as term-frequencies.
So only for fields that are populated by an external process, I am hoping we
can come up with a mechanism to ignore the overflow checks on term/field
statistics.
> Bypass total frequency check if field uses custom term frequency
> ----------------------------------------------------------------
>
> Key: LUCENE-10048
> URL: https://issues.apache.org/jira/browse/LUCENE-10048
> Project: Lucene - Core
> Issue Type: Improvement
> Reporter: Tony Xu
> Priority: Minor
>
> For all fields whose index option is not *IndexOptions.NONE*. There is a
> check on per field total token count (i.e. field-length) to ensure we don't
> index too many tokens. This is done by accumulating the token's
> *TermFrequencyAttribute.*
>
> Given that currently Lucene allows custom term frequency attached to each
> token and the usage of the frequency can be pretty wild. It is possible to
> have the following case where the check fails with only a few tokens that
> have large frequencies. Currently Lucene will skip indexing the whole
> document.
> *"foo|<very large number> bar|<very large number>"*
>
> What should be way to inform the indexing chain not to check the field length?
> A related observation, when custom term frequency is in use, user is not
> likely to use the similarity for this field. Maybe we can offer a way to
> specify that, too?
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]