telemenar commented on issue #1194: URL: https://github.com/apache/arrow-java/issues/1194#issuecomment-4858336810
There is some nuance here because there are two things in tension: * Arrow layout/spec compliance: a list array with `valueCount == 0` still has one offset entry, so the exported offset buffer should contain `offset[0] == 0`. * Current `ListVector` / `LargeListVector` lifecycle behavior: freshly constructed, cleared, or otherwise empty vectors may legitimately have no allocated buffers yet. Either way, the current `setReaderAndWriterIndex()` behavior for a `valueCount == 0` `ListVector` is guaranteed to produce an invalid buffer state if the vector is still in the same state returned by `ListVector.empty()`. There are also a few ways to get into that state somewhat unexpectedly. For example, `ListVector.TransferImpl.splitAndTransfer()` calls `ListVector.clear()` on the destination vector. When the split length is `0`, that can release the destination buffers and leave the destination with `allocator.getEmpty()` as its offset buffer. Given the existing clear/reset behavior and the surrounding class hierarchy, I suspect persistent early enforcement would be tricky to implement cleanly. It would mean preserving or recreating the one-entry offset buffer across construction, `clear()`, and other empty-vector transitions, which may conflict with existing assumptions that an empty vector can hold no buffers. So boundary enforcement may be the more practical direction: allow the internal empty/unallocated state, but materialize the required zero offset at API boundaries where Arrow physical layout matters. The design questions I’m not fully sure about are: * Other than the paths that call `setReaderAndWriterIndex()`, are there other boundaries that need the same protection? * Is it acceptable for `getFieldBuffers()` to allocate the required one-entry offset buffer as part of preparing the vector for export/serialization? * Should call sites like zero-length `splitAndTransfer()` also enforce this before returning a schema-visible destination vector, or should that be left entirely to export-time handling? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
