I've subscribed to review this patch in the current CF, so let me start with a brief summary of the thread as it started more than a year ago.

In general the thread is about hitting the 1GB allocation size limit (and 32-bit length in FE/BE protocol) in various ways:

1) attributes are smaller than 1GB, but row exceeds the limit

This causes issues in COPY/pg_dump, as we allocate a single StringInfo for the whole row, and send it as a single message (which includes int32 length, just like most FE/BE messages).

2) attribute values that get over the 1GB due to encoding

For example it's possible to construct values that are OK in hex encoding but >1GB in escape (and vice versa). Those values get stored just fine, but there's no way to dump / COPY them. And if you happen to have both, you can't do pg_dump :-/

I think it's OK not to be able to store extremely large values, but only if we make it predictable (ideally by rejecting the value before storing in in the database). The cases discussed in the thread are particularly annoying exactly because we do the opposite - we allow the value to be stored, but then fail when retrieving the value or trying to backup the database (which is particularly nasty, IMNSHO).

So although we don't have that many reports about this, it'd be nice to improve the behavior a bit.

The trouble is, this rabbit hole is fairly deep - wherever we palloc or detoast a value, we're likely to hit those issues with wide values/rows.

Honestly, I don't think it's feasible to make all the code paths work with such wide values/rows - as TL points out, particularly with the wide values it would be enormously invasive.

So the patch aims to fix just the simplest case, i.e. when the values are smaller than 1GB but the total row is larger. It does so by allowing StringInfo to exceed the MaxAllocSize in special cases (currently only COPY FROM/TO), and using MCXT_ALLOC_HUGE when forming the heap tuple (to make it possible to load the data).

It seems to work and I do think it's a reasonable first step to make things work.

A few minor comments regarding the patch:

1) CopyStartSend seems pretty pointless - It only has one function call in it, and is called on exactly one place (and all other places simply call allowLongStringInfo directly). I'd get rid of this function and replace the call in CopyOneRowTo(() with allowLongStringInfo().

2) allowlong seems awkward, allowLong or allow_long would be better

3) Why does resetStringInfo reset the allowLong flag? Are there any cases when we want/need to forget the flag value? I don't think so, so let's simply not reset it and get rid of the allowLongStringInfo() calls. Maybe it'd be better to invent a new makeLongStringInfo() method instead, which would make it clear that the flag value is permanent.

4) HandleParallelMessage needs a tweak, as it uses msg->len in a log message, but with '%d' and not '%ld' (as needed after changing the type to Size).

I however wonder whether it wouldn't be better to try Robert's proposal, i.e. not building the huge StringInfo for the whole row, but sending the attribute data directly. I don't mean to send messages for each attribute - the current FE/BE protocol does not allow that (it says that each 'd' message is a whole row), but just sending the network message in smaller chunks.

That would make the StringInfo changes unnecessary, reduce the memory consumption considerably (right now we need 2x the memory, and we know we're dealing with gigabytes of data).

Of course, it'd require significant changes of how copy works internally (instead of accumulating data for the whole row, we'd have to send them right in smaller chunks). Which would allow us getting rid of the StringInfo changes, but we'd not be able to backpatch this.

Regarding interpreting the Int32 length field in FE/BE protocol as unsigned, I'm a bit worried it might qualify as breaking the protocol. It's true we don't really say whether it's signed or unsigned, and we handle it differently depending on the message type, but I wonder how many libraries simply use int32. OTOH those clients are unlikely to handle even the 2GB we might send them without breaking the protocol, so I guess this is fine. And 4GB seems like the best we can do.


Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:

Reply via email to