On Sat, Jul 9, 2016 at 5:44 PM, Konstantin <[email protected]> wrote:
> Thanks > I'm aware of current implementation of merging in Lucene on a high level. > Yes Rhinodog uses B-tree for storing everything, it is a bottleneck on > writes, but it's almost as fast on reads as direct access to location on > disk. > Slower writes for faster reads is the right tradeoff for a search engine, in general, IMO. > (With cold cache, while using SSD reads take less time than decoding > blocks) But may be there is a way to decouple merging/storing + codes from > everything else? Just quickly looking over the sources it actually seems > like a hard task to me. With yet unclear benefits. I'll compare this > compaction strategies. > You mean like Lucene's Codec abstractions? > Also, I have a question about search performance - I'm most likely > testing it in a wrong way - do you test performance on real users > queries? What kinds of queries are more likely? Those where query word's > have similar frequencies, or those where word's frequencies differ by > orders of magnitude? > It's not possible to answer this :( Real user queries and real documents those users were querying is by far best, but they are not easy to come by. In the nightly wikipedia benchmark, e.g. http://home.apache.org/~mikemccand/lucenebench/Phrase.html , I use synthetically generated queries derived from an index to try to mix up the relative frequencies of the terms. Mike McCandless http://blog.mikemccandless.com
