[
https://issues.apache.org/jira/browse/LUCENE-7854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Michael McCandless updated LUCENE-7854:
---------------------------------------
Attachment: LUCENE-7854.patch
Woops, another iteration ;) Thanks [~thetaphi].
> Indexing custom term frequencies
> --------------------------------
>
> Key: LUCENE-7854
> URL: https://issues.apache.org/jira/browse/LUCENE-7854
> Project: Lucene - Core
> Issue Type: Improvement
> Reporter: Michael McCandless
> Assignee: Michael McCandless
> Fix For: master (7.0)
>
> Attachments: LUCENE-7854.patch, LUCENE-7854.patch, LUCENE-7854.patch,
> LUCENE-7854.patch
>
>
> When you index a field with {{IndexOptions.DOCS_AND_FREQS}}, Lucene will
> store just the docID and term frequency (how many times that term occurred in
> that document) for all documents that have a given term.
> We compute that term frequency by counting how many times a given token
> appeared in the field during analysis.
> But it can be useful, in expert use cases, to customize what Lucene stores as
> the term frequency, e.g. to hold custom scoring signals that are a function
> of term and document (this is my use case). Users have also asked for this
> before, e.g. see
> https://stackoverflow.com/questions/26605090/lucene-overwrite-term-frequency-at-index-time.
> One way to do this today is to stuff your custom data into a {{byte[]}}
> payload. But that's quite inefficient, forcing you to index positions, and
> pay the overhead of retrieving payloads at search time.
> Another approach is "token stuffing": just enumerate the same token N times
> where N is the custom number you want to store, but that's also inefficient
> when N gets high.
> I think we can make this simple to do in Lucene. I have a working version,
> using my own custom indexing chain, but the required changes are quite simple
> so I think we can add it to Lucene's default indexing chain?
> I created a new token attribute, {{TermDocFrequencyAttribute}}, and tweaked
> the indexing chain to use that attribute's value as the term frequency if
> it's present, and if the index options are {{DOCS_AND_FREQS}} for that field.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]