[jira] Commented: (LUCENE-738) read/write .del as d-gaps when the deleted bit vector is sufficiently sparse

Doron Cohen (JIRA) Wed, 06 Dec 2006 13:20:51 -0800

    [ 
http://issues.apache.org/jira/browse/LUCENE-738?page=comments#action_12456216 ] 
            
Doron Cohen commented on LUCENE-738:
------------------------------------


I tried two implementations:
(1) writing d-gaps for ids of deleted docs, and 
(2) writing d-gaps for indexes of non zero bytes in the "byte bits []" array, 
plus the non-zero byte itself. 

I favor (2) because it needs less CPU and because its code is simpler. 

I ran the following performance test (Win XP, ~2GHz, 2GB RAM):
- create an index with 2,000,000 short docs.
- delete 4,000 docs in steps of 8: 0, 8, 16, 24, ...
("steps of 8" is worst case for the new code, because this way a single written 
byte never covers more than one deleted doc.)

Results for original code:

Operation......runCnt...recsPerRun........rec/s..elapsedSec
Deletions_4000......1........12000........232.1.......51.70
OpenReader.-..-..4000.-..-..-..-.1.-..-...150.0.-..-..26.67
DeleteDoc........4000............1......1,542.6........2.59
CloseReader.-....4000.-..-..-..-.1.-..-...180.0.-..-..22.23

Size of the .del file was 245KB.

Results for new code:

Operation......runCnt...recsPerRun........rec/s..elapsedSec
Deletions_4000......1........12000........263.8.......45.49
OpenReader.-..-..4000.-..-..-..-.1.-..-...173.3.-..-..23.08
DeleteDoc........4000............1......1,679.3........2.38
CloseReader.-....4000.-..-..-..-.1.-..-...201.4.-..-..19.86

Size of .del file was (at the end) 8KB. 

So the improvement in this somewhat artificial scenario was 13%. 
(I initially thought it would be higher).

I noticed when running repeatedly that the new code run time is more steady, 
while the original code run time fluctuates - e.g. 58.8, 51.7, 56.0, 51.8 
(reported above are the best times.)

We should ask: is this improvement sufficient to justify making BitVector code 
less simple?
Is a shorter .del file an advantage by itself?
I tend to "yes" for both questions, but would like to hear more opinions.

Change is backwards compatible - when opening a .del file (actually a 
bit-vector file) that was written with older version, it identifies that and 
reads it correctly. Existing backwards compatibility tests (of lock-less 
commits) have .del files that are opened correctly with new code.

I added a test case to BitVectorTest - it tests the switch from writing d-gaps 
to writing bits as the number of set bits increases or decreases.
This test also shows for what range of values d-gaps are written:
- size=100: never
- size=1,000: count < 6
- size=10,000: count < 42
- size=100,000: count < 417
- size=1,000,000: count < 3125

All tests pass.


> read/write .del as d-gaps when the deleted bit vector is sufficiently sparse
> ----------------------------------------------------------------------------
>
>                 Key: LUCENE-738
>                 URL: http://issues.apache.org/jira/browse/LUCENE-738
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Store
>    Affects Versions: 2.1
>            Reporter: Doron Cohen
>         Assigned To: Doron Cohen
>
> .del file of a segment maintains info on deleted documents in that segment. 
> The file exists only for segments having deleted docs, so it does not exists 
> for newly created segments (e.g. resulted from merge). Each time closing an 
> index reader that deleted any document, the .del file is rewritten. In fact, 
> since the lock-less commits change a new (generation of) .del file is created 
> in each such occasion.
> For small indexes there is no real problem with current situation. But for 
> very large indexes, each time such an index reader is closed, creating such 
> new bit-vector seems like unnecessary overhead in cases that the bit vector 
> is sparse (just a few docs were deleted). For instance, for an index with a 
> segment of 1M docs, the sequence: {open reader; delete 1 doc from that 
> segment; close reader;} would write a file of ~128KB. Repeat this sequence 8 
> times: 8 new files of total size of 1MB are written to disk.
> Whether this is a bottleneck or not depends on the application deletes 
> pattern, but for the case that deleted docs are sparse, writing just the 
> d-gaps would save space and time. 
> I have this (simple) change to BitVector running and currently trying some 
> performance tests to, yet, convince myself on the worthiness of this.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-738) read/write .del as d-gaps when the deleted bit vector is sufficiently sparse

Reply via email to