[ https://issues.apache.org/jira/browse/LUCENE-7485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Adrien Grand updated LUCENE-7485: --------------------------------- Attachment: LUCENE-7485.patch Here is a patch. I am using norms to play since they have a smaller API, but the idea is to use the same thing for doc values eventually. > Better storage for `docsWithField` in Lucene70NormsFormat > --------------------------------------------------------- > > Key: LUCENE-7485 > URL: https://issues.apache.org/jira/browse/LUCENE-7485 > Project: Lucene - Core > Issue Type: Improvement > Reporter: Adrien Grand > Assignee: Adrien Grand > Priority: Minor > Attachments: LUCENE-7485.patch > > > Currently {{Lucene70NormsFormat}} uses a bit set to store documents that have > a norm, and counts one bits using {{Long.bitCount}} in order to know the > index of the current document in the set of docs that have a norm value. > I think this is fairly good if a field is moderately sparse (somewhere > between 5% and 99%) but it still has some issues like slow advance by large > deltas (it still needs to visit all words in order to accumulate the number > of ones to know the index of a document) or when very few bits are set. > I have been working on a disk-based adaptation of {{RoaringDocIdSet}} that > would still give the ability to know the index of the current document. It > seems to be only a bit slower than the current implementation on moderately > sparse fields. However, it also comes with benefits: > * it is faster in the sparse case when it uses the sparse encoding that uses > shorts to store doc IDs (when the density is 6% or less) > * it has faster advance() by large deltas (still linear, but by a factor of > 65536 so that should always be fine in practice since doc IDs are bound to 2B) > * it uses O(numDocsWithField) storage rather than O(maxDoc), the worst case > in 6 bytes per field, which occurs when each range of 65k docs contains > exactly one document. > * it is faster if some ranges of documents that share the same 16 upper bits > are full, this is useful eg. if there is a single document that misses a > field in the whole index or for use-cases that would store multiple types of > documents (with different fields) within a single index and would use index > sorting to put documents of the same type together -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org