Re: libpq compression

Konstantin Knizhnik Thu, 29 Oct 2020 06:47:03 -0700

Hi,
Thank for review.

On 28.10.2020 22:27, Andres Freund wrote:

I don't see a corresponding configure.ac change?

Shame on me - I completely forgot that configure is actually generatefrom configure.ac.

Fixed.

+     <varlistentry id="libpq-connect-compression" xreflabel="compression">
+      <term><literal>compression</literal></term>
+      <listitem>
+      <para>
+        Request compression of libpq traffic. Client sends to the server list 
of compression algorithms, supported by client library.
+        If server supports one of this algorithms, then it acknowledges use of 
this algorithm and then all libpq messages send both from client to server and
+        visa versa will be compressed. If server is not supporting any of the 
suggested algorithms, then it replies with 'n' (no compression)
+        message and it is up to the client whether to continue work without 
compression or report error.
+        Supported compression algorithms are chosen at configure time. Right 
now two libraries are supported: zlib (default) and zstd (if Postgres was
+        configured with --with-zstd option). In both cases streaming mode is 
used.
+      </para>
+      </listitem>
+     </varlistentry>
+


- there should be a reference to potential security impacts when used in
   combination with encrypted connections

Done

- What does " and it is up to the client whether to continue work
   without compression or report error" actually mean for a libpq parameter?

It can not happen.

The client request from server use of compressed protocol only if"compression=XXX" was specified in connection string.But XXX should be supported by client, otherwise this request will berejected.

So supported protocol string sent by client can never be empty.

- What is the point of the "streaming mode" reference?


There are ways of performing compression:

- block mode when each block is individually compressed (compressorstores dictionary in each compressed blocked)

- stream mode

Block mode allows to independently decompress each page. It is good forimplementing page or field compression (as pglz is used to compresstoast values).

But it is not efficient for compressing client-server protocol commands.

It seems to me to be important to explain that libpq is using streammode and why there is no pglz compressor

Why are compression methods identified by one byte identifiers? That
seems unnecessarily small, given this is commonly a once-per-connection
action?

It is mostly for simplicity of implementation: it is always simple towork with fixed size messages (or with array of chars rather than arrayof strings).And I do not think that it can somehow decrease flexibility: thisone-letter algorihth codes are not visible for user. And I do not thinkthat we sometime will support more than 127 (or even 64 differentcompression algorithms).

The protocol sounds to me like there's no way to enable/disable
compression in an existing connection. To me it seems better to have an
explicit, client initiated, request to use a specific method of
compression (including none). That allows to enable compression for bulk
work, and disable it in other cases (e.g. for security sensitive
content, or for unlikely to compress well content).

It will significantly complicate implementation (because of buffering atdifferent levels).Also it is not clear to me who and how will control enabling/disablingcompression in this case?I can imagine that "\copy" should trigger compression. But what about(server side) "copy" command?Or just select returning huge results? I do not think that making useror application to enable/disable compression on the fly is really good idea.

Overhead of compressing small commands is not so large.

And concerning security risks... In most cases such problem is notrelevant at all because both client and server are located within singlereliable network.It if security of communication really matters, you should not switchcompression in all cases (including COPY and other bulk data transfer).It is very strange idea to let client to decide which data is "securitysensitive" and which not.


I think that would also make cross-version handling easier, because a
newer client driver can send the compression request and handle the
error, without needing to reconnect or such.

Most importantly, I think such a design is basically a necessity to make
connection poolers to work in a sensible way.


I do not completely understand the problem with connection pooler.

Right now developers of Yandex Odyssey are trying to support libpqcompression in their pooler.

If them will be faced with some problems, I will definitely address them.

And lastly, wouldn't it be reasonable to allow to specify things like
compression levels? All that doesn't have to be supported now, but I
think the protocol should take that into account.

Well, if we want to provide the maximal flexibility, then we shouldallow to specify compression level.Practically, when I have implemented our CFS compressed storage forpgpro-ee and libpq_compression I have performeda lot benchmarks comparing different compression algorithms withdifferent compression levels.

Definitely I do not pretend on doing some research in this area.

But IMHO default (fastest) compression level is always the preferablechoice: it provides best compromise between speed and compression ratio.Higher compression levels significantly (several times) reducecompression speed, but influence on compression ratio are much smaller.More over, zstd with default compression level compresses synthetic data(i.e. strings will with spaces generated by pgbench) much better

(with compression ratio 63!) than with higher compression levels.

Right now in Postgres we do not allow user to specify compression levelneither for compressing TOAST data, nether for WAL compression,...And IMHO for libpq protocol compression, possibility to specifycompression level is even less useful.

But if you think that it is so important, I will try to implement it.Many questions arise in this case: which side should control compressionlevel?Should client affect compression level both at client side and at serverside? Or it should be possible to specify separately compression levelfor client and for server?

+<para>
+        Used compression algorithm. Right now the following streaming 
compression algorithms are supported: 'f' - Facebook zstd, 'z' - zlib, 'n' - no 
compression.
+</para>

I would prefer this just be referenced as zstd or zstandard, not
facebook zstd. There's an RFC (albeit only "informational"), and it
doesn't name facebook, except as an employer:
https://tools.ietf.org/html/rfc8478


Please notice that it is internal encoding, user will specify
psql -d "dbname=postgres compression=zstd"

If name "zstd" is not good, I can choose any other.

+int
+pq_configure(Port* port)
+{
+       char* client_compression_algorithms = port->compression_algorithms;
+       /*
+        * If client request compression, it sends list of supported 
compression algorithms.
+        * Each compression algorirthm is idetified by one letter ('f' - 
Facebook zsts, 'z' - xlib)
+        */

s/algorirthm/algorithm/
s/idetified/identified/
s/zsts/zstd/
s/xlib/zlib/

That's, uh, quite the typo density.

Sorry, fixed

+       if (client_compression_algorithms)
+       {
+               char server_compression_algorithms[ZPQ_MAX_ALGORITHMS];
+               char compression_algorithm = ZPQ_NO_COMPRESSION;
+               char compression[6] = {'z',0,0,0,5,0}; /* message length = 5 */
+               int rc;

Why is this hand-rolling protocol messages?

Sorry, I do not quite understand your concern.

It seems to me that all libpq message manipulation is more or lesshand-rolling, isn't it (we are not using protobuf or msgbpack)?Or do you think that calling pq_sendbyte,pq_sendint32,... is much saferin this case?

+               /* Intersect lists */
+               while (*client_compression_algorithms != '\0')
+               {
+                       if (strchr(server_compression_algorithms, 
*client_compression_algorithms))
+                       {
+                               compression_algorithm = 
*client_compression_algorithms;
+                               break;
+                       }
+                       client_compression_algorithms += 1;
+               }

Why isn't this is handled within zpq?

It seems to be part of libpq client-server handshake protocol. It seemsto me that zpq is lower level component which is just ordered whichcompressor to use.

+               /* Send 'z' message to the client with selectde comression 
algorithm ('n' if match is ont found) */

s/selectde/selected/
s/comression/compression/
s/ont/not/


:(
Fixed.

But looks like you are inspecting not the latest patch(libpq_compression-20.patch)

Because two of this three mistypings I have already fixed.

+               socket_set_nonblocking(false);
+               while ((rc = secure_write(MyProcPort, compression, 
sizeof(compression))) < 0
+                          && errno == EINTR);
+               if ((size_t)rc != sizeof(compression))
+                       return -1;

Huh? This all seems like an abstraction violation.

+               /* initialize compression */
+               if (zpq_set_algorithm(compression_algorithm))
+                       PqStream = zpq_create((zpq_tx_func)secure_write, 
(zpq_rx_func)secure_read, MyProcPort);
+       }
+       return 0;
+}

Why is zpq a wrapper around secure_write/read? I'm a bit worried this
will reduce the other places we could use zpq.

zpq has to read/write data from underlying stream.
And it should be used both in client and server environment.

I didn't see other ways to provide single zpq implementation withoutcode duplication

except pass to it rx/tx functions.

  static int
-pq_recvbuf(void)
+pq_recvbuf(bool nowait)
  {
+               /* If srteaming compression is enabled then use correpondent 
comression read function. */

s/srteaming/streaming/
s/correpondent/correponding/
s/comression/compression/

Could you please try to proof-read the patch a bit? The typo density
is quite high.

Once again, sorry
Will do.

+               r = PqStream
+                       ? zpq_read(PqStream, PqRecvBuffer + PqRecvLength,
+                                          PQ_RECV_BUFFER_SIZE - PqRecvLength, 
&processed)
+                       : secure_read(MyProcPort, PqRecvBuffer + PqRecvLength,
+                                                 PQ_RECV_BUFFER_SIZE - 
PqRecvLength);
+               PqRecvLength += processed;

? : doesn't make sense to me in this case. This should be an if/else.

Isn't it a matter of style preference?
Why  if/else is principle better than ?:
I agree that sometimes ?: leads to more complex and obscure expressions.

But to you think that if-else in this case will lead to more clear orreadable code?


Another question is whether conditional expression here is really good idea.
I prefer to replace with indirect function call...

But there are several reasons which makes me to prefer suchstraightforward and not nice way

(at lease difference in function profiles).

                if (r < 0)
                {
+                       if (r == ZPQ_DECOMPRESS_ERROR)
+                       {
+                               char const* msg = zpq_error(PqStream);
+                               if (msg == NULL)
+                                       msg = "end of stream";
+                               ereport(COMMERROR,
+                                               (errcode_for_socket_access(),
+                                                errmsg("failed to decompress data: 
%s", msg)));
+                               return EOF;
+                       }

I don't think we should error out with "failed to decompress data:"
e.g. when the client closed the connection.

Sorry, but this error is reported only when ZPQ_DECOMPRESS_ERROR isreturned.It means that received data can not be decompressed but not that clientconnection is broken.

@@ -1413,13 +1457,18 @@ internal_flush(void)
        char       *bufptr = PqSendBuffer + PqSendStart;
        char       *bufend = PqSendBuffer + PqSendPointer;

- while (bufptr < bufend)

+       while (bufptr < bufend || zpq_buffered(PqStream) != 0) /* has more data 
to flush or unsent data in internal compression buffer */
        {

Overly long line.

Fixed

This should at least specify how these functions are supposed to handle
blocking/nonblocking sockets.

Blocking/nonblocking control is done by upper layer.

This zpq functions implementation calls underlying IO functions and donot care if this calls are blocking or nonblocking.

+
+#define ZSTD_BUFFER_SIZE (8*1024)
+#define ZSTD_COMPRESSION_LEVEL 1

Add some arguments for choosing these parameters.

What are the suggested way to specify them?
I can not put them in GUCs (because them are also needed at client side).
May it possible to for client to specify them in connection string:
psql -d "compression='ztsd/level=10/buffer=8k"
It seems to be awful and overkill, isn't it?

+
+/*
+ * Array with all supported compression algorithms.
+ */
+static ZpqAlgorithm const zpq_algorithms[] =
+{
+#if HAVE_LIBZSTD
+       {zstd_name, zstd_create, zstd_read, zstd_write, zstd_free, zstd_error, 
zstd_buffered},
+#endif
+#if HAVE_LIBZ
+       {zlib_name, zlib_create, zlib_read, zlib_write, zlib_free, zlib_error, 
zlib_buffered},
+#endif
+       {NULL}
+};

I think it's preferrable to use designated initializers.

Do we really need zero terminated lists? Works fine, but brrr.


Once again - a matter of taste:)

Standard C practice IMHO - not invented by me and widely used inPostgres code;)

+/*
+ * Index of used compression algorithm in zpq_algorithms array.
+ */
+static int zpq_algorithm_impl;

This is just odd API design imo. Why doesn't the dispatch work based on
an argument for zpq_create() and the ZpqStream * for the rest?

What if there's two libpq connections in one process? To servers
supporting different compression algorithms? This isn't going to fly.


Fixed.

+/*
+ * Get list of the supported algorithms.
+ * Each algorithm is identified by one letter: 'f' - Facebook zstd, 'z' - zlib.
+ * Algorithm identifies are appended to the provided buffer and terminated by 
'\0'.
+ */
+void
+zpq_get_supported_algorithms(char algorithms[ZPQ_MAX_ALGORITHMS])
+{
+       int i;
+       for (i = 0; zpq_algorithms[i].name != NULL; i++)
+       {
+               Assert(i < ZPQ_MAX_ALGORITHMS);
+               algorithms[i] = zpq_algorithms[i].name();
+       }
+       Assert(i < ZPQ_MAX_ALGORITHMS);
+       algorithms[i] = '\0';
+}

Uh, doesn't this bake ZPQ_MAX_ALGORITHMS into the ABI? That seems
entirely unnecessary?

I tried to avoid use of dynamic memory allocation because zpq is usedboth in client and server environments with different memory allocationpolicies.

@@ -2180,6 +2257,20 @@ build_startup_packet(const PGconn *conn, char *packet,
                ADD_STARTUP_OPTION("replication", conn->replication);
        if (conn->pgoptions && conn->pgoptions[0])
                ADD_STARTUP_OPTION("options", conn->pgoptions);
+       if (conn->compression && conn->compression[0])
+       {
+               bool enabled;
+               /*
+                * If compressoin is enabled, then send to the server list of 
compression algorithms
+                * supported by client
+                */

s/compressoin/compression/


Fixed

+               if (parse_bool(conn->compression, &enabled))
+               {
+                       char compression_algorithms[ZPQ_MAX_ALGORITHMS];
+                       zpq_get_supported_algorithms(compression_algorithms);
+                       ADD_STARTUP_OPTION("compression", 
compression_algorithms);
+               }
+       }

I think this needs to work in a graceful manner across server
versions. You can make that work with an argument, using the _pq_
parameter stuff, but as I said earlier, I think it's a mistake to deal
with this in the startup packet anyway.


Sorry, I think that it should be quite easy for user to toggle compression.

Originally I suggest to add -z option to psql and other Postgresutilities working with libpq protocol.Adding new options was considered by reviewer as bad idea and so I leftcorrespondent option in connection string:


psql -d "dbname=postgres compression=zlib"

It is IMHO less convenient than "psql -z postgres".

And I afraid that using the _pq_ parameter stuff makes enabling ofcompression even less user friendly.So can replace it to _pq_ convention if there is consensus that adding"compression" to startup package should be avoided.


--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: libpq compression

Reply via email to