[jira] [Commented] (LUCENE-7407) Explore switching doc values to an iterator API

Yonik Seeley (JIRA) Fri, 21 Oct 2016 06:43:16 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-7407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15595113#comment-15595113
 ]


Yonik Seeley commented on LUCENE-7407:
--------------------------------------

bq. > A quick test by hand is still more informative than having no information 
at all.
bq.  I disagree: it's reckless to run an overly synthetic benchmark and then 
present the results as if they mean we should make poor API tradeoffs.

When I did a quick test by hand, I always *disclosed* that.  It's a starting 
point, not an ending point.
And even homogenous tests (that are prone to hotspot overspecialization) are a 
useful datapoint, if you know what they are.
Some users will have exactly those types of requests - very homogeneous.

bq. My point is that running synthetic benchmarks and mis-representing them as 
"meaningful" is borderline reckless

The implication being that you judge they are not meaningful? Wow.

You seemed to admit that the lucene benchmarks don't even cover some of these 
cases (or don't cover them adequately).
- There is no single authoritative benchmark, and it's misleading to suggest 
there is (that somehow represents the *true* performance for users)
- The lucene benchmarks are also synthetic to a degree (although based off of 
real data).  For example, the query cache is disabled.  Why?  I assume to 
better isolate what is being tested.
- More realistic tests are always nice to verify that nothing was messed up... 
but a system will *always* have a bottleneck.  The question is *which 
bottleneck are you effectively testing*?
- More tests are better.  If others have the time/ability, they should run 
their own!

bq. [...] nowhere near as helpful as, say, improving our default codec, 
profiling and removing slow spots, removing extra legacy wrappers, etc. Those 
are more positive ways to move our project forward.

The first step I'd take would be to try and realistically isolate and quantify 
the performance of what I was trying to change anyway.  I did that starting off 
with Solr faceting tests (lucene benchmarks don't test that).

I *will* get around to trying and improve things.. in the meantime putting out 
the information I did have is better than hiding it.  Take it for what it is.
If you choose to just dismiss it as meaningless... well, I guess we'll have to 
agree to disagree.


> Explore switching doc values to an iterator API
> -----------------------------------------------
>
>                 Key: LUCENE-7407
>                 URL: https://issues.apache.org/jira/browse/LUCENE-7407
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>              Labels: docValues
>             Fix For: master (7.0)
>
>         Attachments: LUCENE-7407.patch
>
>
> I think it could be compelling if we restricted doc values to use an
> iterator API at read time, instead of the more general random access
> API we have today:
>   * It would make doc values disk usage more of a "you pay for what
>     what you actually use", like postings, which is a compelling
>     reduction for sparse usage.
>   * I think codecs could compress better and maybe speed up decoding
>     of doc values, even in the non-sparse case, since the read-time
>     API is more restrictive "forward only" instead of random access.
>   * We could remove {{getDocsWithField}} entirely, since that's
>     implicit in the iteration, and the awkward "return 0 if the
>     document didn't have this field" would go away.
>   * We can remove the annoying thread locals we must make today in
>     {{CodecReader}}, and close the trappy "I accidentally shared a
>     single XXXDocValues instance across threads", since an iterator is
>     inherently "use once".
>   * We could maybe leverage the numerous optimizations we've done for
>     postings over time, since the two problems ("iterate over doc ids
>     and store something interesting for each") are very similar.
> This idea has come up many in the past, e.g. LUCENE-7253 is a recent
> example, and very early iterations of doc values started with exactly
> this ;)
> However, it's a truly enormous change, likely 7.0 only.  Or maybe we
> could have the new iterator APIs also ported to 6.x side by side with
> the deprecate existing random-access APIs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-7407) Explore switching doc values to an iterator API

Reply via email to