Re: [jira] Commented: (LUCENE-1476) BitVector implement DocIdSet

Marvin Humphrey Wed, 07 Jan 2009 20:29:29 -0800

On Wed, Jan 07, 2009 at 09:28:40PM -0600, robert engels wrote:
> Why not just write the first byte as 0 for a bitsit, and 1 for a  
> sparse bit set (compressed), and make the determination when writing  
> based on the segment size and/or number of set bits.


Are you offering that as a solution to the problem I described here?

> >When you make deletes with the BitSet model, you have to rewrite  
> >files that scale with segment size, regardless of how few deletions  
> >you make. Deletion of a single document in a large segment may  
> >necessitate writing out a substantial bit vector file.
> >
> >In contrast, i/o throughput for writing out a tombstone file scales  
> >with the number of tombstones.

Worst-case i/o costs don't improve under such a regime.  You could still end
up writing a large, uncompressed bit vector file to accommodate a single
deletion.

I suppose that has to be weighed against the search-time costs of interleaving
the tombstone streams.  We can either pay the interleaving penalty at
index-time or search-time.  It's annoying to write out a 1 MB uncompressed bit
vector file for a single deleted doc against an 8-million doc segment, but if
there are enough deletions to justify an uncompressed file, iterating through
them via merged-on-the-fly tombstone streams would be annoying too.

Marvin Humphrey


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [jira] Commented: (LUCENE-1476) BitVector implement DocIdSet

Reply via email to