Michael McCandless created LUCENE-7854:
------------------------------------------
Summary: Indexing custom term frequencies
Key: LUCENE-7854
URL: https://issues.apache.org/jira/browse/LUCENE-7854
Project: Lucene - Core
Issue Type: Improvement
Reporter: Michael McCandless
Assignee: Michael McCandless
Fix For: master (7.0)
When you index a field with {{IndexOptions.DOCS_AND_FREQS}}, Lucene will store
just the docID and term frequency (how many times that term occurred in that
document) for all documents that have a given term.
We compute that term frequency by counting how many times a given token
appeared in the field during analysis.
But it can be useful, in expert use cases, to customize what Lucene stores as
the term frequency, e.g. to hold custom scoring signals that are a function of
term and document (this is my use case). Users have also asked for this
before, e.g. see
https://stackoverflow.com/questions/26605090/lucene-overwrite-term-frequency-at-index-time.
One way to do this today is to stuff your custom data into a {{byte[]}}
payload. But that's quite inefficient, forcing you to index positions, and pay
the overhead of retrieving payloads at search time.
Another approach is "token stuffing": just enumerate the same token N times
where N is the custom number you want to store, but that's also inefficient
when N gets high.
I think we can make this simple to do in Lucene. I have a working version,
using my own custom indexing chain, but the required changes are quite simple
so I think we can add it to Lucene's default indexing chain?
I created a new token attribute, {{TermDocFrequencyAttribute}}, and tweaked the
indexing chain to use that attribute's value as the term frequency if it's
present, and if the index options are {{DOCS_AND_FREQS}} for that field.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]