Hi,
Thank for review.

On 28.10.2020 22:27, Andres Freund wrote:
I don't see a corresponding configure.ac change?
Shame on me - I completely forgot that configure is actually generate from configure.ac.
Fixed.

+     <varlistentry id="libpq-connect-compression" xreflabel="compression">
+      <term><literal>compression</literal></term>
+      <listitem>
+      <para>
+        Request compression of libpq traffic. Client sends to the server list 
of compression algorithms, supported by client library.
+        If server supports one of this algorithms, then it acknowledges use of 
this algorithm and then all libpq messages send both from client to server and
+        visa versa will be compressed. If server is not supporting any of the 
suggested algorithms, then it replies with 'n' (no compression)
+        message and it is up to the client whether to continue work without 
compression or report error.
+        Supported compression algorithms are chosen at configure time. Right 
now two libraries are supported: zlib (default) and zstd (if Postgres was
+        configured with --with-zstd option). In both cases streaming mode is 
used.
+      </para>
+      </listitem>
+     </varlistentry>
+

- there should be a reference to potential security impacts when used in
   combination with encrypted connections
Done

- What does " and it is up to the client whether to continue work
   without compression or report error" actually mean for a libpq parameter?
It can not happen.
The client request from server use of compressed protocol only if "compression=XXX" was specified in connection string. But XXX should be supported by client, otherwise this request will be rejected.
So supported protocol string sent by client can never be empty.

- What is the point of the "streaming mode" reference?

There are ways of performing compression:
- block mode when each block is individually compressed (compressor stores dictionary in each compressed blocked)
- stream mode
Block mode allows to independently decompress each page. It is good for implementing page or field compression (as pglz is used to compress toast values).
But it is not efficient for compressing client-server protocol commands.
It seems to me to be important to explain that libpq is using stream mode and why there is no pglz compressor
Why are compression methods identified by one byte identifiers? That
seems unnecessarily small, given this is commonly a once-per-connection
action?

It is mostly for simplicity of implementation: it is always simple to work with fixed size messages (or with array of chars rather than array of strings). And I do not think that it can somehow decrease flexibility: this one-letter algorihth codes are not visible for user. And I do not think that we sometime will support more than 127 (or even 64 different compression algorithms).
The protocol sounds to me like there's no way to enable/disable
compression in an existing connection. To me it seems better to have an
explicit, client initiated, request to use a specific method of
compression (including none). That allows to enable compression for bulk
work, and disable it in other cases (e.g. for security sensitive
content, or for unlikely to compress well content).

It will significantly complicate implementation (because of buffering at different levels). Also it is not clear to me who and how will control enabling/disabling compression in this case? I can imagine that "\copy" should trigger compression. But what about (server side) "copy"  command? Or just select returning huge results? I do not think that making user or application to enable/disable compression on the fly is really good idea.
Overhead of compressing small commands is not so large.
And concerning security risks... In most cases such problem is not relevant at all because both client and server are located within single reliable network. It if security of communication really matters, you should not switch compression in all cases (including COPY and other bulk data transfer). It is very strange idea to let client to decide which data is "security sensitive" and which not.

I think that would also make cross-version handling easier, because a
newer client driver can send the compression request and handle the
error, without needing to reconnect or such.

Most importantly, I think such a design is basically a necessity to make
connection poolers to work in a sensible way.

I do not completely understand the problem with connection pooler.
Right now developers of Yandex Odyssey are trying to support libpq compression in their pooler.
If them will be faced with some problems, I will definitely address them.
And lastly, wouldn't it be reasonable to allow to specify things like
compression levels? All that doesn't have to be supported now, but I
think the protocol should take that into account.
Well, if we want to provide the maximal flexibility, then we should allow to specify compression level. Practically, when I have implemented our CFS compressed storage for pgpro-ee and libpq_compression I have performed a lot benchmarks comparing different compression algorithms with different compression levels.
Definitely I do not pretend on doing some research in this area.
But IMHO default (fastest) compression level is always the preferable choice: it provides best compromise between speed and compression ratio. Higher compression levels significantly (several times) reduce compression speed, but influence on compression ratio are much smaller. More over, zstd with default compression level compresses synthetic data (i.e. strings will with spaces generated by pgbench)  much better
(with compression ratio 63!) than with higher compression levels.
Right now in Postgres we do not allow user to specify compression level neither for compressing TOAST data, nether for WAL compression,... And IMHO for libpq protocol compression, possibility to specify compression level is even less useful.

But if you think that it is so important, I will try to implement it. Many questions arise in this case: which side should control compression level? Should client affect compression level both at client side and at server side? Or it should be possible to specify separately compression level for client and for server?

+<para>
+        Used compression algorithm. Right now the following streaming 
compression algorithms are supported: 'f' - Facebook zstd, 'z' - zlib, 'n' - no 
compression.
+</para>
I would prefer this just be referenced as zstd or zstandard, not
facebook zstd. There's an RFC (albeit only "informational"), and it
doesn't name facebook, except as an employer:
https://tools.ietf.org/html/rfc8478

Please notice that it is internal encoding, user will specify
psql -d "dbname=postgres compression=zstd"

If name "zstd" is not good, I can choose any other.


+int
+pq_configure(Port* port)
+{
+       char* client_compression_algorithms = port->compression_algorithms;
+       /*
+        * If client request compression, it sends list of supported 
compression algorithms.
+        * Each compression algorirthm is idetified by one letter ('f' - 
Facebook zsts, 'z' - xlib)
+        */
s/algorirthm/algorithm/
s/idetified/identified/
s/zsts/zstd/
s/xlib/zlib/

That's, uh, quite the typo density.

Sorry, fixed


+       if (client_compression_algorithms)
+       {
+               char server_compression_algorithms[ZPQ_MAX_ALGORITHMS];
+               char compression_algorithm = ZPQ_NO_COMPRESSION;
+               char compression[6] = {'z',0,0,0,5,0}; /* message length = 5 */
+               int rc;
Why is this hand-rolling protocol messages?
Sorry, I do not quite understand your concern.
It seems to me that all libpq message manipulation is more or less hand-rolling, isn't it (we are not using protobuf or msgbpack)? Or do you think that  calling pq_sendbyte,pq_sendint32,... is much safer in this case?


+               /* Intersect lists */
+               while (*client_compression_algorithms != '\0')
+               {
+                       if (strchr(server_compression_algorithms, 
*client_compression_algorithms))
+                       {
+                               compression_algorithm = 
*client_compression_algorithms;
+                               break;
+                       }
+                       client_compression_algorithms += 1;
+               }
Why isn't this is handled within zpq?

It seems to be part of libpq client-server handshake protocol. It seems to me that zpq is lower level component which is just ordered which compressor to use.
+               /* Send 'z' message to the client with selectde comression 
algorithm ('n' if match is ont found) */
s/selectde/selected/
s/comression/compression/
s/ont/not/


:(
Fixed.
But looks like you are inspecting not the latest patch (libpq_compression-20.patch)
Because two of this three mistypings I have already fixed.

+               socket_set_nonblocking(false);
+               while ((rc = secure_write(MyProcPort, compression, 
sizeof(compression))) < 0
+                          && errno == EINTR);
+               if ((size_t)rc != sizeof(compression))
+                       return -1;
Huh? This all seems like an abstraction violation.


+               /* initialize compression */
+               if (zpq_set_algorithm(compression_algorithm))
+                       PqStream = zpq_create((zpq_tx_func)secure_write, 
(zpq_rx_func)secure_read, MyProcPort);
+       }
+       return 0;
+}
Why is zpq a wrapper around secure_write/read? I'm a bit worried this
will reduce the other places we could use zpq.
zpq has to read/write data from underlying stream.
And it should be used both in client and server environment.
I didn't see other ways  to provide single zpq implementation without code duplication
except pass to it rx/tx functions.



  static int
-pq_recvbuf(void)
+pq_recvbuf(bool nowait)
  {
+               /* If srteaming compression is enabled then use correpondent 
comression read function. */
s/srteaming/streaming/
s/correpondent/correponding/
s/comression/compression/

Could you please try to proof-read the patch a bit? The typo density
is quite high.
Once again, sorry
Will do.


+               r = PqStream
+                       ? zpq_read(PqStream, PqRecvBuffer + PqRecvLength,
+                                          PQ_RECV_BUFFER_SIZE - PqRecvLength, 
&processed)
+                       : secure_read(MyProcPort, PqRecvBuffer + PqRecvLength,
+                                                 PQ_RECV_BUFFER_SIZE - 
PqRecvLength);
+               PqRecvLength += processed;
? : doesn't make sense to me in this case. This should be an if/else.

Isn't it a matter of style preference?
Why  if/else is principle better than ?:
I agree that sometimes ?: leads to more complex and obscure expressions.
But to you think that if-else in this case will lead to more clear or readable code?

Another question is whether conditional expression here is really good idea.
I prefer to replace with indirect function call...
But there are several reasons which makes me to prefer such straightforward and not nice way
(at lease difference in function profiles).



                if (r < 0)
                {
+                       if (r == ZPQ_DECOMPRESS_ERROR)
+                       {
+                               char const* msg = zpq_error(PqStream);
+                               if (msg == NULL)
+                                       msg = "end of stream";
+                               ereport(COMMERROR,
+                                               (errcode_for_socket_access(),
+                                                errmsg("failed to decompress data: 
%s", msg)));
+                               return EOF;
+                       }
I don't think we should error out with "failed to decompress data:"
e.g. when the client closed the connection.

Sorry, but this error is reported only when ZPQ_DECOMPRESS_ERROR is returned. It means that received data can not be decompressed but not that client connection is broken.



@@ -1413,13 +1457,18 @@ internal_flush(void)
        char       *bufptr = PqSendBuffer + PqSendStart;
        char       *bufend = PqSendBuffer + PqSendPointer;
- while (bufptr < bufend)
+       while (bufptr < bufend || zpq_buffered(PqStream) != 0) /* has more data 
to flush or unsent data in internal compression buffer */
        {
Overly long line.
Fixed
This should at least specify how these functions are supposed to handle
blocking/nonblocking sockets.

Blocking/nonblocking control is done by upper layer.
This zpq functions implementation calls underlying IO functions and do not care if this calls are blocking or nonblocking.

+
+#define ZSTD_BUFFER_SIZE (8*1024)
+#define ZSTD_COMPRESSION_LEVEL 1
Add some arguments for choosing these parameters.

What are the suggested way to specify them?
I can not put them in GUCs (because them are also needed at client side).
May it possible to for client to specify them in connection string:
psql -d "compression='ztsd/level=10/buffer=8k"
It seems to be awful and overkill, isn't it?

+
+/*
+ * Array with all supported compression algorithms.
+ */
+static ZpqAlgorithm const zpq_algorithms[] =
+{
+#if HAVE_LIBZSTD
+       {zstd_name, zstd_create, zstd_read, zstd_write, zstd_free, zstd_error, 
zstd_buffered},
+#endif
+#if HAVE_LIBZ
+       {zlib_name, zlib_create, zlib_read, zlib_write, zlib_free, zlib_error, 
zlib_buffered},
+#endif
+       {NULL}
+};
I think it's preferrable to use designated initializers.

Do we really need zero terminated lists? Works fine, but brrr.

Once again - a matter of taste:)
Standard C practice IMHO - not invented by me and widely used in Postgres code;)


+/*
+ * Index of used compression algorithm in zpq_algorithms array.
+ */
+static int zpq_algorithm_impl;
This is just odd API design imo. Why doesn't the dispatch work based on
an argument for zpq_create() and the ZpqStream * for the rest?

What if there's two libpq connections in one process? To servers
supporting different compression algorithms? This isn't going to fly.

Fixed.


+/*
+ * Get list of the supported algorithms.
+ * Each algorithm is identified by one letter: 'f' - Facebook zstd, 'z' - zlib.
+ * Algorithm identifies are appended to the provided buffer and terminated by 
'\0'.
+ */
+void
+zpq_get_supported_algorithms(char algorithms[ZPQ_MAX_ALGORITHMS])
+{
+       int i;
+       for (i = 0; zpq_algorithms[i].name != NULL; i++)
+       {
+               Assert(i < ZPQ_MAX_ALGORITHMS);
+               algorithms[i] = zpq_algorithms[i].name();
+       }
+       Assert(i < ZPQ_MAX_ALGORITHMS);
+       algorithms[i] = '\0';
+}
Uh, doesn't this bake ZPQ_MAX_ALGORITHMS into the ABI? That seems
entirely unnecessary?

I tried to avoid use of dynamic memory allocation because zpq is used both in client and server environments with different memory allocation policies.

@@ -2180,6 +2257,20 @@ build_startup_packet(const PGconn *conn, char *packet,
                ADD_STARTUP_OPTION("replication", conn->replication);
        if (conn->pgoptions && conn->pgoptions[0])
                ADD_STARTUP_OPTION("options", conn->pgoptions);
+       if (conn->compression && conn->compression[0])
+       {
+               bool enabled;
+               /*
+                * If compressoin is enabled, then send to the server list of 
compression algorithms
+                * supported by client
+                */
s/compressoin/compression/

Fixed
+               if (parse_bool(conn->compression, &enabled))
+               {
+                       char compression_algorithms[ZPQ_MAX_ALGORITHMS];
+                       zpq_get_supported_algorithms(compression_algorithms);
+                       ADD_STARTUP_OPTION("compression", 
compression_algorithms);
+               }
+       }
I think this needs to work in a graceful manner across server
versions. You can make that work with an argument, using the _pq_
parameter stuff, but as I said earlier, I think it's a mistake to deal
with this in the startup packet anyway.

Sorry, I think that it should be quite easy for user to toggle compression.
Originally I suggest to add -z option to psql and other Postgres utilities working with libpq protocol. Adding new options was considered by reviewer as bad idea and so I left correspondent option in connection string:

psql -d "dbname=postgres compression=zlib"

It is IMHO less convenient than "psql -z postgres".
And I afraid that using the _pq_ parameter stuff makes enabling of compression even less user friendly. So can replace it to _pq_ convention if there is consensus that adding "compression" to startup package should be avoided.

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Reply via email to