Compaction logic

Konstantin Fri, 08 Jul 2016 07:13:46 -0700

Hello, my name is Konstantin, I'm currently reading Lucene's sources and
wondering why particular technical decisions were made.


Full disclosure - I'm writing my own inverted index implementation as a pet
project https://github.com/kk00ss/Rhinodog . It's about 4 kloc of scala,
and there are tests comparing it with Lucene on wiki dump (I actually run
only on small part of it ~500MB).

Most interesting to me, is why compaction algorithm is implemented this way
- it's clear and simple, but wouldn't it be better to merge postings lists
on a per term basis. Well current Lucene implementation is probably better
for HDDs and proposed would need SSD to show adequate performance.
But that would be more of smaller compactions, each much chipper. Some
times if a term has small posting list - it would be inefficient, but I
think some threshold can be used.
This idea comes from an assumption that when half of the documents were
removed from a segment - not all the terms might need compaction,  assuming
non-uniform distribution of terms among documents (which seems likely to
me, an amateur ;-) ).

Does it make any sense ?
BTW, Any input about Rhinodog and it's benchmarks vs Lucene would be
appreciated.

Compaction logic

Reply via email to