[
https://issues.apache.org/jira/browse/LUCENE-6840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Adrien Grand updated LUCENE-6840:
---------------------------------
Attachment: LUCENE-6840.patch
Here is a patch that should improve memory usage for:
- variable-length binary fields
- multi-valued sorted numeric fields
- multi-valued sorted set fields
On the other hand, the BINARY_PREFIX_COMPRESSED format still uses
MononicBlockPackedReader/Writer.
I wrote the patch by changing Lucene50DocValuesFormat to make it easier to
review, but when it's ready I plan to make it a whole new format (with new
Lucene54Codec, etc.).
Compared to previously, only per-block metadata is kept around in memory, data
is written to disk using the DirectWriter/slice APIs. Out of curiosity I tried
to write all entries of my /usr/share/dict/words file into a binary dv field to
see how it compares to trunk:
{noformat}
trunk
.dvd: 992334 bytes
.dvm 128 bytes
memory usage 153124 bytes
patch
.dvd 1038100 bytes
.dvm 165 bytes
memory usage 232 bytes
{noformat}
One important thing is that I had to introduce some per-thread memory usage:
each thread needs to have its own array of DirectReader instances (one per
block). This is why I raised the block size from 16K to 64K in order to have
fewer blocks. Maybe this would need to be even more increased (but this would
also hurt compression a bit). In the worst case that someone has a segment with
2B documents, there would be 32k blocks of 64k values so each thread would need
about 1.2MB of memory. In my opinion it's ok since apps should query their
Lucene indices from a reasonable number of threads, and it would probably still
be much better than today since even requiring a single bit of memory per
document (with today's MonotonicBlockPackedReader) would use 256MB of memory.
> Put ord indexes of doc values on disk
> -------------------------------------
>
> Key: LUCENE-6840
> URL: https://issues.apache.org/jira/browse/LUCENE-6840
> Project: Lucene - Core
> Issue Type: Improvement
> Reporter: Adrien Grand
> Assignee: Adrien Grand
> Priority: Minor
> Attachments: LUCENE-6840.patch
>
>
> Currently we still load monotonic blocks into memory to map doc ids to an
> offset on disk. Since these data structures are usually consumed sequentially
> I would like to investigate putting them to disk.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]