[jira] [Updated] (LUCENE-7485) Better storage for `docsWithField` in Lucene70NormsFormat

Adrien Grand (JIRA) Mon, 10 Oct 2016 07:14:35 -0700

     [ 
https://issues.apache.org/jira/browse/LUCENE-7485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Adrien Grand updated LUCENE-7485:
---------------------------------
    Attachment: LUCENE-7485.patch

Here is a patch. I am using norms to play since they have a smaller API, but 
the idea is to use the same thing for doc values eventually.

> Better storage for `docsWithField` in Lucene70NormsFormat
> ---------------------------------------------------------
>
>                 Key: LUCENE-7485
>                 URL: https://issues.apache.org/jira/browse/LUCENE-7485
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Adrien Grand
>            Assignee: Adrien Grand
>            Priority: Minor
>         Attachments: LUCENE-7485.patch
>
>
> Currently {{Lucene70NormsFormat}} uses a bit set to store documents that have 
> a norm, and counts one bits using {{Long.bitCount}} in order to know the 
> index of the current document in the set of docs that have a norm value.
> I think this is fairly good if a field is moderately sparse (somewhere 
> between 5% and 99%) but it still has some issues like slow advance by large 
> deltas (it still needs to visit all words in order to accumulate the number 
> of ones to know the index of a document) or when very few bits are set.
> I have been working on a disk-based adaptation of {{RoaringDocIdSet}} that 
> would still give the ability to know the index of the current document. It 
> seems to be only a bit slower than the current implementation on moderately 
> sparse fields. However, it also comes with benefits:
>  * it is faster in the sparse case when it uses the sparse encoding that uses 
> shorts to store doc IDs (when the density is 6% or less)
>  * it has faster advance() by large deltas (still linear, but by a factor of 
> 65536 so that should always be fine in practice since doc IDs are bound to 2B)
>  * it uses O(numDocsWithField) storage rather than O(maxDoc), the worst case 
> in 6 bytes per field, which occurs when each range of 65k docs contains 
> exactly one document.
>  * it is faster if some ranges of documents that share the same 16 upper bits 
> are full, this is useful eg. if there is a single document that misses a 
> field in the whole index or for use-cases that would store multiple types of 
> documents (with different fields) within a single index and would use index 
> sorting to put documents of the same type together



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-7485) Better storage for `docsWithField` in Lucene70NormsFormat

Reply via email to