On Sat, Aug 9, 2014 at 3:51 PM, Tom Lane <t...@sss.pgh.pa.us> wrote:
> Kevin Grittner <kgri...@ymail.com> writes:
>>> Stephen Frost <sfr...@snowman.net> writes:
>>>> Trying to move the header to the end just for the sake of this
>>>> doesn't strike me as a good solution as it'll make things quite
>>>> a bit more complicated.
>
>> Why is that?  How much harder would it be to add a single offset
>> field to the front to point to the part we're shifting to the end?
>> It is not all that unusual to put a directory at the end, like in
>> the .zip file format.
>
> Yeah, I was wondering that too.  Arguably, directory-at-the-end would
> be easier to work with for on-the-fly creation, not that we do any
> such thing at the moment.  I think the main thing that's bugging Stephen
> is that doing that just to make pglz_compress happy seems like a kluge
> (and I have to agree).
>
> Here's a possibly more concrete thing to think about: we may very well
> someday want to support JSONB object field or array element extraction
> without reading all blocks of a large toasted JSONB value, if the value is
> stored external without compression.  We already went to the trouble of
> creating analogous logic for substring extraction from a long uncompressed
> text or bytea value, so I think this is a plausible future desire.  With
> the current format you could imagine grabbing the first TOAST chunk, and
> then if you see the header is longer than that you can grab the remainder
> of the header without any wasted I/O, and for the array-subscripting case
> you'd now have enough info to fetch the element value from the body of
> the JSONB without any wasted I/O.  With directory-at-the-end you'd
> have to read the first chunk just to get the directory pointer, and this
> would most likely not give you any of the directory proper; but at least
> you'd know exactly how big the directory is before you go to read it in.
> The former case is probably slightly better.  However, if you're doing an
> object key lookup not an array element fetch, neither of these formats are
> really friendly at all, because each binary-search probe probably requires
> bringing in one or two toast chunks from the body of the JSONB value so
> you can look at the key text.  I'm not sure if there's a way to redesign
> the format to make that less painful/expensive --- but certainly, having
> the key texts scattered through the JSONB value doesn't seem like a great
> thing from this standpoint.

I think that's a good point.

On the general topic, I don't think it's reasonable to imagine that
we're going to come up with a single heuristic that works well for
every kind of input data.  What pglz is doing - assuming that if the
beginning of the data is incompressible then the rest probably is too
- is fundamentally reasonable, nonwithstanding the fact that it
doesn't happen to work out well for JSONB.  We might be able to tinker
with that general strategy in some way that seems to fix this case and
doesn't appear to break others, but there's some risk in that, and
there's no obvious reason in my mind why PGLZ should be require to fly
blind.  So I think it would be a better idea to arrange some method by
which JSONB (and perhaps other data types) can provide compression
hints to pglz.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to