[ 
https://issues.apache.org/jira/browse/LUCENE-6840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14960514#comment-14960514
 ] 

Adrien Grand commented on LUCENE-6840:
--------------------------------------

I did a quick benchmark with the geonames dataset (10740477 documents), with 
only two doc values fields:
 - name as a binary dv field
 - alternatenames as a sorted set dv field

Then I measured disk/memory usage and tried to sort on both fields (using 
SortedSetSelector.Type.MIN for the multi-valued field) with a MatchAllDocsQuery:

||branch||index size(MB)||memory usage(MB)||time to sort on "name" (ms)||time 
to sort on "alternatenames" (ms)||
|trunk|311|33.1|4750|3570|
|patch|319 (+2%)|1.4 (-96%)|5390 (+13%)|4000 (+12%)|

Maybe there are things we can optimize with the patch, but even with these 
numbers I think this patch has a better trade-off: I am not very happy that the 
current format takes more than 3 bytes of memory per document for only two doc 
values fields.

> Put ord indexes of doc values on disk
> -------------------------------------
>
>                 Key: LUCENE-6840
>                 URL: https://issues.apache.org/jira/browse/LUCENE-6840
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Adrien Grand
>            Assignee: Adrien Grand
>            Priority: Minor
>         Attachments: LUCENE-6840.patch
>
>
> Currently we still load monotonic blocks into memory to map doc ids to an 
> offset on disk. Since these data structures are usually consumed sequentially 
> I would like to investigate putting them to disk.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to