wgtmac commented on PR #714:
URL: https://github.com/apache/iceberg-cpp/pull/714#issuecomment-4703598652

   Thanks for improving this, @rahulsmahadev! I think the long-term shape is to 
make this output-schema driven, ideally allowing callers to choose `list` vs 
`large_list` per projected column. Arrow’s Parquet reader only exposes a global 
`set_list_type` today, so keeping a global flag for this PR seems reasonable.
   
   One API concern: since this option lives in generic `ReaderProperties`, it 
reads as format-independent. Either other readers such as Avro should honor it 
too, or the option should be clearly scoped/documented as Parquet-only.
   
   One thing that is worth noting: when `read.arrow.use-large-list` is enabled, 
the reader always rewrites the output schema to `large_list`, but Arrow ignores 
`set_list_type(LARGE_LIST)` when the Parquet file contains serialized 
`ARROW:schema` metadata. In that case the actual batch can still contain 
`ListArray`, while `ProjectRecordBatch` dispatches as `LargeListArray`, causing 
a bad cast/crash. Please base the output schema on the actual reader schema or 
add a guard, and cover this with an `ARROW:schema` regression test.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to