alamb commented on PR #5557: URL: https://github.com/apache/arrow-rs/pull/5557#issuecomment-2051686065
> From the talk here ( https://github.com/apache/arrow-rs/pull/5618#issuecomment-2045839896 ) Views type would keeps views schema. However, I think write by StringView/BinaryView doesn't means a read type is BinaryView/StringView. Should we just use Binary/String? > Otherwise, we need extra checking for handling written-by view and read by string. If StringView/BinaryView is keep in metadata, hope a parquet-testing file for this could be add to testing legacy parquet reader could reading this to string/binary without casting I agree with @mapleFU 's observation that the special handling StringView / BinaryView will be substantially more code. The reason that it would be valuable is that i think it can save a copy For example as I understand it, from the [parquet encodings doc](https://parquet.apache.org/docs/file-format/data-pages/encodings/) ``` ... BYTE_ARRAY: length in 4 bytes little endian followed by the bytes contained in the array ... ``` So the data looks like this (length prefix, followed by the bytes): ``` \3\0\0\0foo\26\0\0\0abcdefghijklmnoprstuvwxyz ``` To make a StringArray, those bytes must be copied to a new buffer so they are contiguous: ``` offets: [0, 3, 29] data: fooabcdefghijklmnoprstuvwxyz ``` However, for a StringView array, the raw bytes can be used without copying ``` views: [(len: 3, data:"foo"), (len:26, offset8)] \3\0\0\0foo\26\0\0\0abcdefghijklmnoprstuvwxyz ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
