On Fri, Aug 1, 2014 at 2:20 AM, Jeff Janes <jeff.ja...@gmail.com> wrote:

> On Thursday, July 31, 2014, Larry White <ljw1...@gmail.com> wrote:
>
>>  Hi,
>>
>> I'm running an experiment on 9.4 beta 2.
>>
>> I put 275,000 identical JSON files into a table using JSONB (one per
>> row).  Each raw text file is 251K in size, so the total uncompressed is
>> 69GB. The column storage is set to EXTENDED.  There are other toastable
>> columns in the table, but none have more than 36 bytes of data in them.
>>
>> My Toast table is 66GB. I would have expected to get that much (or more)
>> compression just from JSONB being a binary format.
>>
>> If I compress one of these JSON files outside of Postgres, it goes from
>> 251K to 1K.
>>
>
> That is an astonishing amount of compression.  Are you really compressing
> one of the files in isolation, rather than co-compressing many and then
> pro-rating the result?
>

Yes, I should have explained. These are generated JSON files for testing
and there is massive repetition in them, which is why they compress so well
outside of Postgres. (Basically there is a repeating array of the same
string) I did compress just the one.


>
> Can you provide an example of the data, and the command line you used to
> compress it?
>

Compressed on a Mac with the Compress UI option. Here's a brief sample from
the file. You can see why it compresses so well:
"{\"junk\":[\"124245etweetwet345gwtretwt43 qwrqwq qwre qw
rsdflkas\",\"q4535 wqrqwrqwrqw2 wrqwrqwrq32232w kswe sfasrs
sdfsd\",\"124245etweetwet345gwtretwt43 qwrqwq qwre qw rsdflkas\",\"q4535
wqrqwrqwrqw2 wrqwrqwrq32232w kswe sfasrs
sdfsd\",\"124245etweetwet345gwtretwt43 qwrqwq qwre qw rsdflkas\",\"q4535
wqrqwrqwrqw2 wrqwrqwrq32232w kswe sfasrs
sdfsd\",\"124245etweetwet345gwtretwt43 qwrqwq qwre qw rsdflkas\",\"q4535
wqrqwrqwrqw2 wrqwrqwrq32232w kswe sfasrs
sdfsd\",\"124245etweetwet345gwtretwt43 qwrqwq qwre qw rsdflkas\",\"q4535
wqrqwrqwrqw2 wrqwrqwrq32232w kswe sfasrs
sdfsd\",\"124245etweetwet345gwtretwt43 qwrqwq qwre qw rsdflkas\",\"q4535

and so on.

So each file should theoretically fit on a single row in the toast table.
>> In total, the amount well under a GB when compressed outside of PG.
>>
>> Any guesses as to why there is so little compression of this data or how
>> I might remedy the situation?
>>
>
> PostgreSQL's built in tuple compression is generally not very good.  It is
> good at compressing long strings of identical bytes, but not good at
> compressing the type of thing you are likely to find in JSON (unless your
> JSON had long strings of spaces to reflect indentation of deeply nested
> structures, which JSON probably wouldn't do and which JSONB certainly
> wouldn't).  It was designed to be very fast and to be unencumbered with the
> patent issues common at the time it was written.  It was not designed to
> give the best possible compression ratios.
>
> It also compresses each row independently.  Most of the compression
> opportunities in a column of JSON data would probably be between rows, when
> the same keys show up and over and over again, not within a row.  But it
> can't capture those opportunities.
>

I'm not expecting miracles with real data, but as far as I can tell, there
is zero compression happening. I'm wondering if it is disabled for JSONB
for some reason.


>
> Cheers,
>
> Jeff
>

Reply via email to