fenfeng9 commented on issue #49740: URL: https://github.com/apache/arrow/issues/49740#issuecomment-4699363013
@andishgar Thanks for looking into this, and no worries at all. My current understanding is that, for all-inline `BinaryView` / `Utf8View` values, no variadic buffer is actually needed. In practice, I found that Arrow can currently produce/accept both of these layouts: ```text [validity, views] [validity, views, zero-size variadic data buffer] ``` For example, the normal C++ `BinaryViewBuilder::Append()` path produces the first form, while this PyArrow construction path produces the second form: ```python import pyarrow as pa arr = pa.array([b"ab", b"cd", b"ef"], type=pa.binary_view()) print([None if b is None else b.size for b in arr.buffers()]) ``` ```text [None, 48, 0] ``` The extra zero-size buffer seems to come from the PyArrow conversion path reserving data space before appending the value: https://github.com/apache/arrow/blob/16fe34250a2ef261790b9cc414fdf0831669cf9f/python/pyarrow/src/arrow/python/python_to_arrow.cc#L770-L775 For short values, the data is still encoded inline, so the reserved heap block is unused and later becomes a zero-size variadic data buffer. For this case: ```text [validity, views, null variadic buffer slot] ``` I also don't see the spec explicitly clarifying this, but I tend to think this should be considered an invalid state. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
