[jira] [Updated] (LUCENE-6840) Put ord indexes of doc values on disk

Adrien Grand (JIRA) Thu, 15 Oct 2015 07:01:06 -0700

     [ 
https://issues.apache.org/jira/browse/LUCENE-6840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Adrien Grand updated LUCENE-6840:
---------------------------------
    Attachment: LUCENE-6840.patch

Here is a patch that should improve memory usage for:
 - variable-length binary fields
 - multi-valued sorted numeric fields
 - multi-valued sorted set fields

On the other hand, the BINARY_PREFIX_COMPRESSED format still uses 
MononicBlockPackedReader/Writer.

I wrote the patch by changing Lucene50DocValuesFormat to make it easier to 
review, but when it's ready I plan to make it a whole new format (with new 
Lucene54Codec, etc.).

Compared to previously, only per-block metadata is kept around in memory, data 
is written to disk using the DirectWriter/slice APIs. Out of curiosity I tried 
to write all entries of my /usr/share/dict/words file into a binary dv field to 
see how it compares to trunk:

{noformat}
trunk
  .dvd: 992334 bytes
  .dvm 128 bytes
  memory usage 153124 bytes
patch
  .dvd 1038100 bytes
  .dvm 165 bytes
  memory usage 232 bytes
{noformat}

One important thing is that I had to introduce some per-thread memory usage: 
each thread needs to have its own array of DirectReader instances (one per 
block). This is why I raised the block size from 16K to 64K in order to have 
fewer blocks. Maybe this would need to be even more increased (but this would 
also hurt compression a bit). In the worst case that someone has a segment with 
2B documents, there would be 32k blocks of 64k values so each thread would need 
about 1.2MB of memory. In my opinion it's ok since apps should query their 
Lucene indices from a reasonable number of threads, and it would probably still 
be much better than today since even requiring a single bit of memory per 
document (with today's MonotonicBlockPackedReader) would use 256MB of memory.

> Put ord indexes of doc values on disk
> -------------------------------------
>
>                 Key: LUCENE-6840
>                 URL: https://issues.apache.org/jira/browse/LUCENE-6840
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Adrien Grand
>            Assignee: Adrien Grand
>            Priority: Minor
>         Attachments: LUCENE-6840.patch
>
>
> Currently we still load monotonic blocks into memory to map doc ids to an 
> offset on disk. Since these data structures are usually consumed sequentially 
> I would like to investigate putting them to disk.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (LUCENE-6840) Put ord indexes of doc values on disk

Reply via email to