read/write .del as d-gaps when the deleted bit vector is sufficiently sparse
-----------------------------------------------------------------------------
Key: LUCENE-738
URL: http://issues.apache.org/jira/browse/LUCENE-738
Project: Lucene - Java
Issue Type: Improvement
Components: Store
Affects Versions: 2.1
Reporter: Doron Cohen
Assigned To: Doron Cohen
.del file of a segment maintains info on deleted documents in that segment. The
file exists only for segments having deleted docs, so it does not exists for
newly created segments (e.g. resulted from merge). Each time closing an index
reader that deleted any document, the .del file is rewritten. In fact, since
the lock-less commits change a new (generation of) .del file is created in each
such occasion.
For small indexes there is no real problem with current situation. But for very
large indexes, each time such an index reader is closed, creating such new
bit-vector seems like unnecessary overhead in cases that the bit vector is
sparse (just a few docs were deleted). For instance, for an index with a
segment of 1M docs, the sequence: {open reader; delete 1 doc from that segment;
close reader;} would write a file of ~128KB. Repeat this sequence 8 times: 8
new files of total size of 1MB are written to disk.
Whether this is a bottleneck or not depends on the application deletes pattern,
but for the case that deleted docs are sparse, writing just the d-gaps would
save space and time.
I have this (simple) change to BitVector running and currently trying some
performance tests to, yet, convince myself on the worthiness of this.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]