Re: [HACKERS] jsonb format is pessimal for toast compression

Gavin Flower Thu, 14 Aug 2014 15:11:36 -0700

On 15/08/14 09:47, Tom Lane wrote:

Peter Geoghegan <[email protected]> writes:

On Thu, Aug 14, 2014 at 10:57 AM, Tom Lane <[email protected]> wrote:

Maybe this is telling us it's not worth changing the representation,
and we should just go do something about the first_success_by threshold
and be done.  I'm hesitant to draw such conclusions on the basis of a
single use-case though, especially one that doesn't really have that
much use for compression in the first place.  Do we have other JSON
corpuses to look at?

Yes. Pavel posted some representative JSON data a while back:
http://pgsql.cz/data/data.dump.gz (it's a plain dump)

I did some quick stats on that.  206560 rows


                                        min     max     avg

external text representation            220     172685  880.3

JSON representation (compressed text)   224     78565   541.3

pg_column_size, JSONB HEAD repr.        225     82540   639.0

pg_column_size, all-lengths repr.       225     66794   531.1

So in this data, there definitely is some scope for compression:
just compressing the text gets about 38% savings.  The all-lengths
hack is able to beat that slightly, but the all-offsets format is
well behind at 27%.

Not sure what to conclude.  It looks from both these examples like
we're talking about a 10 to 20 percent size penalty for JSON objects
that are big enough to need compression.  Is that beyond our threshold
of pain?  I'm not sure, but there is definitely room to argue that the
extra I/O costs will swamp any savings we get from faster access to
individual fields or array elements.

                        regards, tom lane

Curious, would adding the standard deviation help in characterising thedistribution of data values?

Also you might like to consider additionally using the median value, andpossibly the 25% & 75% (or some such) values. I assume the 'avg' inyour table, refers to the arithmetic mean. Sometimes the median is abetter meaure of 'normal' than the arithmetic mean, and it can be usefulto note the difference between the two!

Graphing the values may also be useful. You might have 2, or more,distinct populations which might show up as several distinct peaks - inwhich case, this might suggest changes to the algorithm.

Many moons ago, I did a 400 level statistics course at University, ofwhich I've forgotten most. However, I'm aware of other potentiallyuseful measure, but I suspect that they would be too esoteric for thecurrent problem!



Cheers,
Gavin



--
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] jsonb format is pessimal for toast compression

Reply via email to