alamb commented on issue #5530:
URL: https://github.com/apache/arrow-rs/issues/5530#issuecomment-2051688234
@mapleFU brought up a good point on
https://github.com/apache/arrow-rs/pull/5557#issuecomment-2046484889 that I
wanted to record here. The observation is that that the special handling
StringView / BinaryView will be substantially more code in the parquet decoder.
The reason for adding the special case is to avoid a copy
For example as I understand it, from the [parquet encodings
doc](https://parquet.apache.org/docs/file-format/data-pages/encodings/)
```
...
BYTE_ARRAY: length in 4 bytes little endian followed by the bytes contained
in the array
...
```
So the data looks like this (length prefix, followed by the bytes):
```
\3\0\0\0foo\26\0\0\0abcdefghijklmnoprstuvwxyz
```
To make a StringArray, those bytes must be copied to a new buffer so they
are contiguous:
```
offets: [0, 3, 29]
data: fooabcdefghijklmnoprstuvwxyz
```
However, for a StringView array, the raw bytes can be used without copying
```
views: [(len: 3, data:"foo"), (len:26, offset8)]
\3\0\0\0foo\26\0\0\0abcdefghijklmnoprstuvwxyz
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]