Jorge Leitão created ARROW-15687:
------------------------------------

             Summary: [Format] Clarify that 8 byte padding must not be applied 
to compressed buffers
                 Key: ARROW-15687
                 URL: https://issues.apache.org/jira/browse/ARROW-15687
             Project: Apache Arrow
          Issue Type: Improvement
          Components: Format
            Reporter: Jorge Leitão


I was unable to find where this is discussed, but I think we do not mention 
that 8 byte padding must not be applied when the buffer is compressed, as it 
causes us to lose the size of the compressed buffer.

For example

```
import pyarrow.ipc

data = [
    pyarrow.array([1, 2, 3, 4, 5], type="int32"),
]

batch = pyarrow.record_batch(data, names=['f0'])

with pyarrow.OSFile('test1.arrow', 'wb') as sink:
    with pyarrow.ipc.new_file(sink, batch.schema, 
options=pyarrow.ipc.IpcWriteOptions(compression="zstd")) as writer:
        writer.write(batch)
```

outputs a single data buffer with

```
[20, 0, 0, 0, 0, 0, 0, 0, 40, 181, 47, 253, 32, 20, 161, 0, 0, 1, 0, 0, 0, 2, 
0, 0, 0, 3, 0, 0, 0, 4, 0, 0, 0, 5, 0, 0, 0]
```
which has 37 bytes (padding would require 40 bytes).

My understanding is that we do not pad because doing so make us unable to 
recover the original size of the (compressed) data, and offers no advantage 
since users can't mmap data anyways.




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to