Re: Compaction logic

Michael McCandless Sat, 09 Jul 2016 14:23:11 -0700

Rhinodog looks neat!  Impressively small sources :)

Merging postings per term would be hard for Lucene, because it's write
once, i.e. once a segment's terms and postings are written, we cannot
update them in place.  Instead, we merge N segments together to a new
larger (also write once) segment.


Whereas it looks like Rhinodog's term dictionary uses on-disk data
structures (btree?) that can be updated in place?

Mike McCandless

http://blog.mikemccandless.com

On Fri, Jul 8, 2016 at 10:13 AM, Konstantin <[email protected]>
wrote:

> Hello, my name is Konstantin, I'm currently reading Lucene's sources and
> wondering why particular technical decisions were made.
>
> Full disclosure - I'm writing my own inverted index implementation as a
> pet project https://github.com/kk00ss/Rhinodog . It's about 4 kloc of
> scala, and there are tests comparing it with Lucene on wiki dump (I
> actually run only on small part of it ~500MB).
>
> Most interesting to me, is why compaction algorithm is implemented this
> way - it's clear and simple, but wouldn't it be better to merge postings
> lists on a per term basis. Well current Lucene implementation is probably
> better for HDDs and proposed would need SSD to show adequate performance.
> But that would be more of smaller compactions, each much chipper. Some
> times if a term has small posting list - it would be inefficient, but I
> think some threshold can be used.
> This idea comes from an assumption that when half of the documents were
> removed from a segment - not all the terms might need compaction,  assuming
> non-uniform distribution of terms among documents (which seems likely to
> me, an amateur ;-) ).
>
> Does it make any sense ?
> BTW, Any input about Rhinodog and it's benchmarks vs Lucene would be
> appreciated.
>
>

Re: Compaction logic

Reply via email to