alamb commented on PR #5557:
URL: https://github.com/apache/arrow-rs/pull/5557#issuecomment-2051686065

   > From the talk here ( 
https://github.com/apache/arrow-rs/pull/5618#issuecomment-2045839896 ) Views 
type would keeps views schema. However, I think write by StringView/BinaryView 
doesn't means a read type is BinaryView/StringView. Should we just use 
Binary/String?
   
   > Otherwise, we need extra checking for handling written-by view and read by 
string. If StringView/BinaryView is keep in metadata, hope a parquet-testing 
file for this could be add to testing legacy parquet reader could reading this 
to string/binary without casting
   
   I agree with @mapleFU 's observation that the special handling StringView / 
BinaryView will be substantially more code. 
   
   The reason that it would be valuable is that i think it can save a copy
   
   For example as I understand it, from the [parquet  encodings 
doc](https://parquet.apache.org/docs/file-format/data-pages/encodings/)
   
   ```
   ...
   BYTE_ARRAY: length in 4 bytes little endian followed by the bytes contained 
in the array
   ...
   ```
   
   So the  data looks like this (length prefix, followed by the bytes):
   ```
   \3\0\0\0foo\26\0\0\0abcdefghijklmnoprstuvwxyz
   ```
   
   To make a StringArray, those bytes must be copied to a new buffer so they 
are contiguous:
   ```
   offets: [0, 3, 29]
   data: fooabcdefghijklmnoprstuvwxyz
   ```
   
   However, for a StringView array, the raw bytes can be used without copying
   
   ```
   views: [(len: 3, data:"foo"), (len:26, offset8)] 
   \3\0\0\0foo\26\0\0\0abcdefghijklmnoprstuvwxyz
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to