Re: [PR] GH-37756: [Format][Docs] Document IPC Compression [arrow]

via GitHub Wed, 11 Sep 2024 09:57:28 -0700


jorisvandenbossche commented on code in PR #43950:
URL: https://github.com/apache/arrow/pull/43950#discussion_r1755173036



##########
docs/source/format/Columnar.rst:
##########
@@ -1385,6 +1387,59 @@ have two entries in each RecordBatch. For a RecordBatch 
of this schema with
     buffer 13: col2    data
 
 
+Compression
+-----------
+
+There are three different options for compression of record batch
+body buffers: Buffers can be uncompressed, buffers can be
+compressed with the ``lz4`` compression codec, or buffers can
+be compressed with the ``zstd`` compression codec. Buffers in
+the flat sequence of a message body must be either all
+uncompressed or all compressed separately using the same codec.
+
+.. note::
+
+   ``lz4`` compression codec means the
+   `LZ4 frame format 
<https://github.com/lz4/lz4/blob/dev/doc/lz4_Frame_format.md>`_
+   and should not to be confused with
+   `"raw" (also called "block") format 
<https://github.com/lz4/lz4/blob/dev/doc/lz4_Block_format.md>`_.
+
+The difference between compressed and uncompressed buffers in the
+serialized form is as follows:
+
+* If the buffers in the :ref:`ipc-recordbatch-message` are **compressed**
+
+  - the ``data header`` includes the length and memory offset
+    of each **compressed buffer** in the record batch's body
+
+  - the ``body`` includes a flat sequence of **compressed buffers**
+    together with the **length of uncompressed buffer** as a 64-bit
+    little-endian signed integer stored in the first 8 bytes for each
+    buffer in the sequence
+
+* If the buffers in the :ref:`ipc-recordbatch-message` are **uncompressed**
+
+  - the ``data header`` includes the length and memory offset
+    of each **uncompressed buffer** in the record batch's body
+
+  - the ``body`` includes a flat sequence of **uncompressed buffers**
+    with the first 8 bytes empty or equal to ``-1`` to indicate that
+    the buffer is uncompressed

Review Comment:
   Oh, apologies, I see I made an unfortunate typo in my comment above .. My 
first sentence was about the **un**compressed case, not the compressed case (so 
it is in the uncompressed case there is no leading 8 bytes with the length).
   
   But I think you got it anyway seeing the updated text, which looks good.
   
   And I think it is fine to keep the bullet point with the uncompressed case. 
It might now be a bit "trivial" but I think it is still good to explicitly 
contrast it with the compressed case.
   



##########
docs/source/format/Columnar.rst:
##########
@@ -1385,6 +1387,59 @@ have two entries in each RecordBatch. For a RecordBatch 
of this schema with
     buffer 13: col2    data
 
 
+Compression
+-----------
+
+There are three different options for compression of record batch
+body buffers: Buffers can be uncompressed, buffers can be
+compressed with the ``lz4`` compression codec, or buffers can
+be compressed with the ``zstd`` compression codec. Buffers in
+the flat sequence of a message body must be either all
+uncompressed or all compressed separately using the same codec.
+
+.. note::
+
+   ``lz4`` compression codec means the
+   `LZ4 frame format 
<https://github.com/lz4/lz4/blob/dev/doc/lz4_Frame_format.md>`_
+   and should not to be confused with
+   `"raw" (also called "block") format 
<https://github.com/lz4/lz4/blob/dev/doc/lz4_Block_format.md>`_.
+
+The difference between compressed and uncompressed buffers in the
+serialized form is as follows:
+
+* If the buffers in the :ref:`ipc-recordbatch-message` are **compressed**
+
+  - the ``data header`` includes the length and memory offset
+    of each **compressed buffer** in the record batch's body
+
+  - the ``body`` includes a flat sequence of **compressed buffers**
+    together with the **length of uncompressed buffer** as a 64-bit
+    little-endian signed integer stored in the first 8 bytes for each
+    buffer in the sequence
+
+* If the buffers in the :ref:`ipc-recordbatch-message` are **uncompressed**
+
+  - the ``data header`` includes the length and memory offset
+    of each **uncompressed buffer** in the record batch's body
+
+  - the ``body`` includes a flat sequence of **uncompressed buffers**
+    with the first 8 bytes empty or equal to ``-1`` to indicate that
+    the buffer is uncompressed

Review Comment:
   Oh, apologies, I see I made an unfortunate typo in my comment above .. My 
first sentence was about the **un**compressed case, not the compressed case (so 
it is in the uncompressed case there is no leading 8 bytes with the length). 
   (edited that now)
   
   But I think you got it anyway seeing the updated text, which looks good.
   
   And I think it is fine to keep the bullet point with the uncompressed case. 
It might now be a bit "trivial" but I think it is still good to explicitly 
contrast it with the compressed case.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] GH-37756: [Format][Docs] Document IPC Compression [arrow]

Reply via email to