Re: [HACKERS] gaussian distribution pgbench

Fabien COELHO Thu, 17 Jul 2014 13:15:13 -0700

However, ISTM that it is not the purpose of pgbench documentation to be a
primer about what is an exponential or gaussian distribution, so the idea
would yet be to have a relatively compact explanation, and that the
interested but clueless reader would document h..self from wikipedia or a
text book or a friend or a math teacher (who could be a friend as well:-).


Well, I think it's a balance.  I agree that the pgbench documentation
shouldn't try to substitute for a text book or a math teacher, but I
also think that you shouldn't necessarily need to refer to a text book
or a math teacher in order to figure out how to use pgbench.  Saying
"it's complicated, so we don't have to explain it" would be a cop out;
we need to *make* it simple.  And if there's no way to do that, then
IMHO we should reject the patch in favor of some future patch that
implements something that will be easy for users to understand.

 [nttcom@localhost postgresql]$ contrib/pgbench/pgbench --exponential=10
starting vacuum...end.
transaction type: Exponential distribution TPC-B (sort of)
scaling factor: 1
exponential threshold: 10.00000

decile percents: 63.2% 23.3% 8.6% 3.1% 1.2% 0.4% 0.2% 0.1% 0.0% 0.0%
highest/lowest percent of the range: 9.5% 0.0%


I don't have a clue what that means.  None.


Maybe we could add in front of the decile/percent

"distribution of increasing account key values selected by pgbench:"


I still wouldn't know what that meant.  And it misses the point
anyway: if the documentation is good, this will be unnecessary.  If
the documentation is bad, a printout that tries to illustrate it by
example is not an acceptable substitute.


The decile description is quite classic when discussing statistics.

Here is an example of an explanation that would make sense to me.
This is not the actual behavior of your patch, I'm quite sure, so this
is just an example of the *kind* of explanation that I think is
needed:


This is more or less the approximate behavior of the patch, but for 1% of
the range, not 50%. However I'm not sure that the current documentation is
so bad.


I think it isn't, because in the system I described, a larger value
indicates a flatter distribution, but in the documentation, a smaller
value indicates a flatter distribution.


Ok. But the general thrust was ok.

That having been said, I agree the current documentation for theexponential distribution is not too bad. But this part does not makesense:
+      A crude approximation of the distribution is that the most frequent 1%
+      values are drawn <replaceable>threshold</>% of the time.

I'm trying to be nice to the reader by providing an intuitiveinformation. I do not seem to succeed:-) I'm attempting to say that whenyou draw from a range, say 1 to 1000, the first 1%, i.e. values 1 to 10,

are draw about "threshold"% of the time.

If I draw from one hundred values:

    \setrandom x 1 100 exponential 10.0

The 1 will be drawn about 10% of the time, and the 99 next values willshare the remaining 90%.

+      The closer to 0.0 the threshold, the flatter (more uniform) the access
+      distribution.

Given the first statement, I'd expect the lowest possible threshold to
be 0.01, not 0.

This is in the sense of "epsilon", small number close to 0 but differentfrom 0. The lowest possible threshold is the smalleststrictly positive representable with a "double".

The documentation for the Gaussian distribution is in somewhat worse
shape.  Unlike the documentation for exponential, it makes no attempt
at all to give the user a clear idea what the distribution actually
looks like.  The closest it comes is this:

+      In other worlds, the larger the <replaceable>threshold</>,
+      the narrower the access range around the middle.

But that's not really very close - there's no way for a user to judge
what impact the threshold parameter actually has except to try it.
Unlike the discussion of exponential, which contains a fairly-precise
mathematical characterization of the behavior,

I have now added a precise formula for Gaussian. When you see the formula,maybe you still would want see the decile to have an intuition.

I think that we assumed that the reader would know that a gaussiandistribution is the classic bell-shaped distribution, and if not .?hewould not be interested anyway.

the Gaussian stuff has
nothing except a hand-wavy explanation that a higher threshold skews
the distribution more.  (Also, the English expression is "in other
words" not "in other worlds" - but in fact the phrase has no business
in that sentence at all, because it is not reiterating the contents of
the previous sentence in different language, but rather making a new
point entirely.  And the following sentence does not start with a
capital letter, though maybe that's because it was intended to be
incorporated into this sentence somehow.)

I think that you also need to consider which instances of the words
"gaussian" and "exponential" are referring to the option and which are
referring to the abstract mathematical concept.  When you're talking
about the option, you should use all lower-case (as you've done) but
with <literal> tags or similar.  When you're referring to the
mathematical distribution, Gaussian should be capitalized.

BTW, I agree with both Heikki's suggestion that we make these options
to setrandom only and not expose command-line options for them, and
with Andres's critique that the documentation of those options is far
too repetitive.

I'll have yet another ago at trying to improve the documentation, esp thegaussian part. However you must allow that these are Mathematics, and theuser who wants to use these distribution will be expected to understandwhat they are somehow beforehand.


Moreover, I cannot make it precise, intuitive and very short.

--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] gaussian distribution pgbench

Reply via email to