[
https://issues.apache.org/jira/browse/LUCENE-7485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Adrien Grand updated LUCENE-7485:
---------------------------------
Attachment: LUCENE-7485.patch
Thanks for having a look Ryan! Here is an updated patch.
> Better storage for `docsWithField` in Lucene70NormsFormat
> ---------------------------------------------------------
>
> Key: LUCENE-7485
> URL: https://issues.apache.org/jira/browse/LUCENE-7485
> Project: Lucene - Core
> Issue Type: Improvement
> Reporter: Adrien Grand
> Assignee: Adrien Grand
> Priority: Minor
> Attachments: LUCENE-7485.patch, LUCENE-7485.patch
>
>
> Currently {{Lucene70NormsFormat}} uses a bit set to store documents that have
> a norm, and counts one bits using {{Long.bitCount}} in order to know the
> index of the current document in the set of docs that have a norm value.
> I think this is fairly good if a field is moderately sparse (somewhere
> between 5% and 99%) but it still has some issues like slow advance by large
> deltas (it still needs to visit all words in order to accumulate the number
> of ones to know the index of a document) or when very few bits are set.
> I have been working on a disk-based adaptation of {{RoaringDocIdSet}} that
> would still give the ability to know the index of the current document. It
> seems to be only a bit slower than the current implementation on moderately
> sparse fields. However, it also comes with benefits:
> * it is faster in the sparse case when it uses the sparse encoding that uses
> shorts to store doc IDs (when the density is 6% or less)
> * it has faster advance() by large deltas (still linear, but by a factor of
> 65536 so that should always be fine in practice since doc IDs are bound to 2B)
> * it uses O(numDocsWithField) storage rather than O(maxDoc), the worst case
> in 6 bytes per field, which occurs when each range of 65k docs contains
> exactly one document.
> * it is faster if some ranges of documents that share the same 16 upper bits
> are full, this is useful eg. if there is a single document that misses a
> field in the whole index or for use-cases that would store multiple types of
> documents (with different fields) within a single index and would use index
> sorting to put documents of the same type together
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]