Thanks
I'm aware of current implementation of merging in Lucene on a high level.
Yes Rhinodog uses B-tree for storing everything, it is a bottleneck on
writes, but it's almost as fast on reads as direct access to location on
disk. (With cold cache, while using SSD reads take less time than decoding
blocks) But may be there is a way to decouple merging/storing + codes from
everything else? Just quickly  looking over the sources it actually seems
like a hard task to me. With yet unclear benefits. I'll compare this
compaction strategies.

Also, I have a question  about search performance - I'm most likely
testing  it in a wrong way - do you test performance on real users
queries?  What kinds of queries are more likely? Those where query word's
have similar frequencies, or those where word's frequencies differ by
orders of magnitude?
10 Июл 2016 г. 0:22 пользователь "Michael McCandless" <
[email protected]> написал:

> Rhinodog looks neat!  Impressively small sources :)
>
> Merging postings per term would be hard for Lucene, because it's write
> once, i.e. once a segment's terms and postings are written, we cannot
> update them in place.  Instead, we merge N segments together to a new
> larger (also write once) segment.
>
> Whereas it looks like Rhinodog's term dictionary uses on-disk data
> structures (btree?) that can be updated in place?
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
> On Fri, Jul 8, 2016 at 10:13 AM, Konstantin <[email protected]>
> wrote:
>
>> Hello, my name is Konstantin, I'm currently reading Lucene's sources and
>> wondering why particular technical decisions were made.
>>
>> Full disclosure - I'm writing my own inverted index implementation as a
>> pet project https://github.com/kk00ss/Rhinodog . It's about 4 kloc of
>> scala, and there are tests comparing it with Lucene on wiki dump (I
>> actually run only on small part of it ~500MB).
>>
>> Most interesting to me, is why compaction algorithm is implemented this
>> way - it's clear and simple, but wouldn't it be better to merge postings
>> lists on a per term basis. Well current Lucene implementation is probably
>> better for HDDs and proposed would need SSD to show adequate performance.
>> But that would be more of smaller compactions, each much chipper. Some
>> times if a term has small posting list - it would be inefficient, but I
>> think some threshold can be used.
>> This idea comes from an assumption that when half of the documents were
>> removed from a segment - not all the terms might need compaction,  assuming
>> non-uniform distribution of terms among documents (which seems likely to
>> me, an amateur ;-) ).
>>
>> Does it make any sense ?
>> BTW, Any input about Rhinodog and it's benchmarks vs Lucene would be
>> appreciated.
>>
>>
>

Reply via email to