[ https://issues.apache.org/jira/browse/LUCENE-7407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15595113#comment-15595113 ]
Yonik Seeley commented on LUCENE-7407: -------------------------------------- bq. > A quick test by hand is still more informative than having no information at all. bq. I disagree: it's reckless to run an overly synthetic benchmark and then present the results as if they mean we should make poor API tradeoffs. When I did a quick test by hand, I always *disclosed* that. It's a starting point, not an ending point. And even homogenous tests (that are prone to hotspot overspecialization) are a useful datapoint, if you know what they are. Some users will have exactly those types of requests - very homogeneous. bq. My point is that running synthetic benchmarks and mis-representing them as "meaningful" is borderline reckless The implication being that you judge they are not meaningful? Wow. You seemed to admit that the lucene benchmarks don't even cover some of these cases (or don't cover them adequately). - There is no single authoritative benchmark, and it's misleading to suggest there is (that somehow represents the *true* performance for users) - The lucene benchmarks are also synthetic to a degree (although based off of real data). For example, the query cache is disabled. Why? I assume to better isolate what is being tested. - More realistic tests are always nice to verify that nothing was messed up... but a system will *always* have a bottleneck. The question is *which bottleneck are you effectively testing*? - More tests are better. If others have the time/ability, they should run their own! bq. [...] nowhere near as helpful as, say, improving our default codec, profiling and removing slow spots, removing extra legacy wrappers, etc. Those are more positive ways to move our project forward. The first step I'd take would be to try and realistically isolate and quantify the performance of what I was trying to change anyway. I did that starting off with Solr faceting tests (lucene benchmarks don't test that). I *will* get around to trying and improve things.. in the meantime putting out the information I did have is better than hiding it. Take it for what it is. If you choose to just dismiss it as meaningless... well, I guess we'll have to agree to disagree. > Explore switching doc values to an iterator API > ----------------------------------------------- > > Key: LUCENE-7407 > URL: https://issues.apache.org/jira/browse/LUCENE-7407 > Project: Lucene - Core > Issue Type: Improvement > Reporter: Michael McCandless > Assignee: Michael McCandless > Labels: docValues > Fix For: master (7.0) > > Attachments: LUCENE-7407.patch > > > I think it could be compelling if we restricted doc values to use an > iterator API at read time, instead of the more general random access > API we have today: > * It would make doc values disk usage more of a "you pay for what > what you actually use", like postings, which is a compelling > reduction for sparse usage. > * I think codecs could compress better and maybe speed up decoding > of doc values, even in the non-sparse case, since the read-time > API is more restrictive "forward only" instead of random access. > * We could remove {{getDocsWithField}} entirely, since that's > implicit in the iteration, and the awkward "return 0 if the > document didn't have this field" would go away. > * We can remove the annoying thread locals we must make today in > {{CodecReader}}, and close the trappy "I accidentally shared a > single XXXDocValues instance across threads", since an iterator is > inherently "use once". > * We could maybe leverage the numerous optimizations we've done for > postings over time, since the two problems ("iterate over doc ids > and store something interesting for each") are very similar. > This idea has come up many in the past, e.g. LUCENE-7253 is a recent > example, and very early iterations of doc values started with exactly > this ;) > However, it's a truly enormous change, likely 7.0 only. Or maybe we > could have the new iterator APIs also ported to 6.x side by side with > the deprecate existing random-access APIs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org