2 Michael McCandless Yes I meant Lucene Codecs abstraction - but it alone doesn't cut it. Well there is probably no easy way to integrate this approach into Lucene. And there is not enough evidence that it makes sense at all. I should test this approach with B-trees and small compactions first. With existing Lucene's benchmarks as reference. 2 Erick Erickson About "typical user query" - now it's clear that was speculation.
Thanks everybody 2016-07-10 22:08 GMT+03:00 Erick Erickson <[email protected]>: > Comments from the peanut gallery... > > I'd state it much more harshly. There's no such thing as a "typical > user query" ;) We spend a lot of time trying to score documents to > return the "best" answer.... which is totally irrelevant to some of > the applications we see where the only concern is aggregations. Which > is totally irrelevant for apps (think, say patents or drug research or > most other things legal and many things academic) where the overriding > concern is seeing _all_ the documents pertaining to the search. Which > is totally irrelevant to e-commerce where the concern is how much > margin say an aggregator makes which is totally irrelevant to <insert > case N+1 here> > > I truly wish there was a better answer here, but until there is I'd > just use Mike's stuff if you can, at least that way you're comparing a > long-running benchmark with the new code. > > FWIW, > Erick > > On Sun, Jul 10, 2016 at 6:45 AM, Michael McCandless > <[email protected]> wrote: > > On Sat, Jul 9, 2016 at 5:44 PM, Konstantin <[email protected]> > wrote: > >> > >> Thanks > >> I'm aware of current implementation of merging in Lucene on a high > level. > >> Yes Rhinodog uses B-tree for storing everything, it is a bottleneck on > >> writes, but it's almost as fast on reads as direct access to location on > >> disk. > > > > Slower writes for faster reads is the right tradeoff for a search > engine, in > > general, IMO. > >> > >> (With cold cache, while using SSD reads take less time than decoding > >> blocks) But may be there is a way to decouple merging/storing + codes > from > >> everything else? Just quickly looking over the sources it actually > seems > >> like a hard task to me. With yet unclear benefits. I'll compare this > >> compaction strategies. > > > > You mean like Lucene's Codec abstractions? > >> > >> Also, I have a question about search performance - I'm most likely > >> testing it in a wrong way - do you test performance on real users > queries? > >> What kinds of queries are more likely? Those where query word's have > similar > >> frequencies, or those where word's frequencies differ by orders of > >> magnitude? > > > > It's not possible to answer this :( > > > > Real user queries and real documents those users were querying is by far > > best, but they are not easy to come by. > > > > In the nightly wikipedia benchmark, e.g. > > http://home.apache.org/~mikemccand/lucenebench/Phrase.html , I use > > synthetically generated queries derived from an index to try to mix up > the > > relative frequencies of the terms. > > > > Mike McCandless > > > > http://blog.mikemccandless.com > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] > >
