alamb commented on issue #5530:
URL: https://github.com/apache/arrow-rs/issues/5530#issuecomment-2051688234

    @mapleFU brought up a good point on 
https://github.com/apache/arrow-rs/pull/5557#issuecomment-2046484889 that I 
wanted to record here. The  observation is that that the special handling 
StringView / BinaryView will be substantially more code in the parquet decoder. 
   
   The reason for adding the special case is to avoid a copy
   
   For example as I understand it, from the [parquet  encodings 
doc](https://parquet.apache.org/docs/file-format/data-pages/encodings/)
   
   ```
   ...
   BYTE_ARRAY: length in 4 bytes little endian followed by the bytes contained 
in the array
   ...
   ```
   
   So the  data looks like this (length prefix, followed by the bytes):
   ```
   \3\0\0\0foo\26\0\0\0abcdefghijklmnoprstuvwxyz
   ```
   
   To make a StringArray, those bytes must be copied to a new buffer so they 
are contiguous:
   ```
   offets: [0, 3, 29]
   data: fooabcdefghijklmnoprstuvwxyz
   ```
   
   However, for a StringView array, the raw bytes can be used without copying
   
   ```
   views: [(len: 3, data:"foo"), (len:26, offset8)] 
   \3\0\0\0foo\26\0\0\0abcdefghijklmnoprstuvwxyz
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to