Rhinodog looks neat! Impressively small sources :) Merging postings per term would be hard for Lucene, because it's write once, i.e. once a segment's terms and postings are written, we cannot update them in place. Instead, we merge N segments together to a new larger (also write once) segment.
Whereas it looks like Rhinodog's term dictionary uses on-disk data structures (btree?) that can be updated in place? Mike McCandless http://blog.mikemccandless.com On Fri, Jul 8, 2016 at 10:13 AM, Konstantin <[email protected]> wrote: > Hello, my name is Konstantin, I'm currently reading Lucene's sources and > wondering why particular technical decisions were made. > > Full disclosure - I'm writing my own inverted index implementation as a > pet project https://github.com/kk00ss/Rhinodog . It's about 4 kloc of > scala, and there are tests comparing it with Lucene on wiki dump (I > actually run only on small part of it ~500MB). > > Most interesting to me, is why compaction algorithm is implemented this > way - it's clear and simple, but wouldn't it be better to merge postings > lists on a per term basis. Well current Lucene implementation is probably > better for HDDs and proposed would need SSD to show adequate performance. > But that would be more of smaller compactions, each much chipper. Some > times if a term has small posting list - it would be inefficient, but I > think some threshold can be used. > This idea comes from an assumption that when half of the documents were > removed from a segment - not all the terms might need compaction, assuming > non-uniform distribution of terms among documents (which seems likely to > me, an amateur ;-) ). > > Does it make any sense ? > BTW, Any input about Rhinodog and it's benchmarks vs Lucene would be > appreciated. > >
