Re: [HACKERS] Built-in binning functions

Petr Jelinek Sun, 31 Aug 2014 12:56:03 -0700

On 30/08/14 19:24, Tom Lane wrote:

Pavel Stehule <[email protected]> writes:

1. I am thinking so reduction to only numeric types is not necessary -
although we can live without it - but there are lot of non numeric
categories: chars, date, ...


I wasn't terribly happy about that either.  I still think we should
reduce this to a single polymorphic function, as in the attached.

I did try to write generic function very similar to what you wrote butdiscarded it because of the performance reasons. If we really want tosupport any data type I am all for having generic function but I stillwould like to have one optimized for float8 because that seems to be themost used use-case (at least that one is the reason why I even wrotethis) for performance reasons.


Here are some numbers:
First float8:

CREATE TABLE wbtest AS SELECT (random() * 100)::float8 a FROMgenerate_series(1,1000000) ORDER BY 1;


SELECT
width_bucket(a, ARRAY[10, 20, 30, 40, 50, 60, 70, 80, 90]::float8[]),
width_bucket(a, ARRAY[10, 20, 30, 40, 50, 60, 70, 80, 91]::float8[]),
width_bucket(a, ARRAY[10, 20, 30, 40, 50, 60, 70, 80, 92]::float8[])
FROM wbtest;

Optimized float8: ~250ms
Tom's generic: ~400ms


Numeric:

CREATE TABLE wbtestn AS SELECT (random() * 100)::numeric a FROMgenerate_series(1,1000000) ORDER BY 1;


SELECT
width_bucket(a, ARRAY[10, 20, 30, 40, 50, 60, 70, 80, 90]::numeric[]),
width_bucket(a, ARRAY[10, 20, 30, 40, 50, 60, 70, 80, 91]::numeric[]),
width_bucket(a, ARRAY[10, 20, 30, 40, 50, 60, 70, 80, 92]::numeric[])
FROM wbtestn;

Optimized numeric: ~600ms
My generic: ~780ms
Tom's generic: ~900ms

The difference between my generic and Tom's generic is because Tom's isslowed down by the deconstruct_array.

2. Still I strongly afraid about used searching method - there is not
possible to check a  validity of input. Did you check how much linear
searching is slower - you spoke, so you have a real data and real use case?


Well, if we check the input then we'll be doing O(N) comparisons instead
of O(log N).  That could be a serious cost penalty for large arrays and
expensive comparison functions (such as strcoll()).  I think it's probably
sufficient to have a clear warning in the docs.


With resort the performance is worse than bunch of CASE WHEN :(


Also, a documentation quibble: in Petr's patch the documentation and
comments refer to the thresholds as "right bounds", which I didn't care
for and replaced with "upper bounds".  However, looking at it again,
I wonder if we would not be better off describing them as "lower bounds".
They are exclusive bounds if seen as upper bounds, and inclusive if seen
as lower bounds, and on the whole I think the latter viewpoint might be
less confusing.  Thoughts?

Upper bounds is probably better name than right bounds, I agree withthat. In any case it's upper bound for a bucket (that's why there is onemore bucket to which everything bigger than max threshold goes into).



--
 Petr Jelinek                  http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


--
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Built-in binning functions

Reply via email to