I've subscribed to review this patch in the current CF, so let me start
with a brief summary of the thread as it started more than a year ago.
In general the thread is about hitting the 1GB allocation size limit
(and 32-bit length in FE/BE protocol) in various ways:
1) attributes are smaller than 1GB, but row exceeds the limit
This causes issues in COPY/pg_dump, as we allocate a single StringInfo
for the whole row, and send it as a single message (which includes int32
length, just like most FE/BE messages).
2) attribute values that get over the 1GB due to encoding
For example it's possible to construct values that are OK in hex
encoding but >1GB in escape (and vice versa). Those values get stored
just fine, but there's no way to dump / COPY them. And if you happen to
have both, you can't do pg_dump :-/
I think it's OK not to be able to store extremely large values, but only
if we make it predictable (ideally by rejecting the value before storing
in in the database). The cases discussed in the thread are particularly
annoying exactly because we do the opposite - we allow the value to be
stored, but then fail when retrieving the value or trying to backup the
database (which is particularly nasty, IMNSHO).
So although we don't have that many reports about this, it'd be nice to
improve the behavior a bit.
The trouble is, this rabbit hole is fairly deep - wherever we palloc or
detoast a value, we're likely to hit those issues with wide values/rows.
Honestly, I don't think it's feasible to make all the code paths work
with such wide values/rows - as TL points out, particularly with the
wide values it would be enormously invasive.
So the patch aims to fix just the simplest case, i.e. when the values
are smaller than 1GB but the total row is larger. It does so by allowing
StringInfo to exceed the MaxAllocSize in special cases (currently only
COPY FROM/TO), and using MCXT_ALLOC_HUGE when forming the heap tuple (to
make it possible to load the data).
It seems to work and I do think it's a reasonable first step to make
A few minor comments regarding the patch:
1) CopyStartSend seems pretty pointless - It only has one function call
in it, and is called on exactly one place (and all other places simply
call allowLongStringInfo directly). I'd get rid of this function and
replace the call in CopyOneRowTo(() with allowLongStringInfo().
2) allowlong seems awkward, allowLong or allow_long would be better
3) Why does resetStringInfo reset the allowLong flag? Are there any
cases when we want/need to forget the flag value? I don't think so, so
let's simply not reset it and get rid of the allowLongStringInfo()
calls. Maybe it'd be better to invent a new makeLongStringInfo() method
instead, which would make it clear that the flag value is permanent.
4) HandleParallelMessage needs a tweak, as it uses msg->len in a log
message, but with '%d' and not '%ld' (as needed after changing the type
I however wonder whether it wouldn't be better to try Robert's proposal,
i.e. not building the huge StringInfo for the whole row, but sending the
attribute data directly. I don't mean to send messages for each
attribute - the current FE/BE protocol does not allow that (it says that
each 'd' message is a whole row), but just sending the network message
in smaller chunks.
That would make the StringInfo changes unnecessary, reduce the memory
consumption considerably (right now we need 2x the memory, and we know
we're dealing with gigabytes of data).
Of course, it'd require significant changes of how copy works internally
(instead of accumulating data for the whole row, we'd have to send them
right in smaller chunks). Which would allow us getting rid of the
StringInfo changes, but we'd not be able to backpatch this.
Regarding interpreting the Int32 length field in FE/BE protocol as
unsigned, I'm a bit worried it might qualify as breaking the protocol.
It's true we don't really say whether it's signed or unsigned, and we
handle it differently depending on the message type, but I wonder how
many libraries simply use int32. OTOH those clients are unlikely to
handle even the 2GB we might send them without breaking the protocol, so
I guess this is fine. And 4GB seems like the best we can do.
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Sent via pgsql-hackers mailing list (email@example.com)
To make changes to your subscription: