[
http://issues.apache.org/jira/browse/LUCENE-738?page=comments#action_12456216 ]
Doron Cohen commented on LUCENE-738:
------------------------------------
I tried two implementations:
(1) writing d-gaps for ids of deleted docs, and
(2) writing d-gaps for indexes of non zero bytes in the "byte bits []" array,
plus the non-zero byte itself.
I favor (2) because it needs less CPU and because its code is simpler.
I ran the following performance test (Win XP, ~2GHz, 2GB RAM):
- create an index with 2,000,000 short docs.
- delete 4,000 docs in steps of 8: 0, 8, 16, 24, ...
("steps of 8" is worst case for the new code, because this way a single written
byte never covers more than one deleted doc.)
Results for original code:
Operation......runCnt...recsPerRun........rec/s..elapsedSec
Deletions_4000......1........12000........232.1.......51.70
OpenReader.-..-..4000.-..-..-..-.1.-..-...150.0.-..-..26.67
DeleteDoc........4000............1......1,542.6........2.59
CloseReader.-....4000.-..-..-..-.1.-..-...180.0.-..-..22.23
Size of the .del file was 245KB.
Results for new code:
Operation......runCnt...recsPerRun........rec/s..elapsedSec
Deletions_4000......1........12000........263.8.......45.49
OpenReader.-..-..4000.-..-..-..-.1.-..-...173.3.-..-..23.08
DeleteDoc........4000............1......1,679.3........2.38
CloseReader.-....4000.-..-..-..-.1.-..-...201.4.-..-..19.86
Size of .del file was (at the end) 8KB.
So the improvement in this somewhat artificial scenario was 13%.
(I initially thought it would be higher).
I noticed when running repeatedly that the new code run time is more steady,
while the original code run time fluctuates - e.g. 58.8, 51.7, 56.0, 51.8
(reported above are the best times.)
We should ask: is this improvement sufficient to justify making BitVector code
less simple?
Is a shorter .del file an advantage by itself?
I tend to "yes" for both questions, but would like to hear more opinions.
Change is backwards compatible - when opening a .del file (actually a
bit-vector file) that was written with older version, it identifies that and
reads it correctly. Existing backwards compatibility tests (of lock-less
commits) have .del files that are opened correctly with new code.
I added a test case to BitVectorTest - it tests the switch from writing d-gaps
to writing bits as the number of set bits increases or decreases.
This test also shows for what range of values d-gaps are written:
- size=100: never
- size=1,000: count < 6
- size=10,000: count < 42
- size=100,000: count < 417
- size=1,000,000: count < 3125
All tests pass.
> read/write .del as d-gaps when the deleted bit vector is sufficiently sparse
> ----------------------------------------------------------------------------
>
> Key: LUCENE-738
> URL: http://issues.apache.org/jira/browse/LUCENE-738
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Store
> Affects Versions: 2.1
> Reporter: Doron Cohen
> Assigned To: Doron Cohen
>
> .del file of a segment maintains info on deleted documents in that segment.
> The file exists only for segments having deleted docs, so it does not exists
> for newly created segments (e.g. resulted from merge). Each time closing an
> index reader that deleted any document, the .del file is rewritten. In fact,
> since the lock-less commits change a new (generation of) .del file is created
> in each such occasion.
> For small indexes there is no real problem with current situation. But for
> very large indexes, each time such an index reader is closed, creating such
> new bit-vector seems like unnecessary overhead in cases that the bit vector
> is sparse (just a few docs were deleted). For instance, for an index with a
> segment of 1M docs, the sequence: {open reader; delete 1 doc from that
> segment; close reader;} would write a file of ~128KB. Repeat this sequence 8
> times: 8 new files of total size of 1MB are written to disk.
> Whether this is a bottleneck or not depends on the application deletes
> pattern, but for the case that deleted docs are sparse, writing just the
> d-gaps would save space and time.
> I have this (simple) change to BitVector running and currently trying some
> performance tests to, yet, convince myself on the worthiness of this.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]