Re: [PR] Document Arrow <--> Parquet schema conversion better [arrow-rs]

via GitHub Wed, 07 May 2025 10:36:23 -0700


alamb commented on PR #7479:
URL: https://github.com/apache/arrow-rs/pull/7479#issuecomment-2859515712


   The reason for me writing this PR is that I don't think it is clear how 
parquet / arrow schema conversions are handled, including the embedded arrow 
schema hint and then the APIs that let people supply / modify their own hint
   
   
   > I think the major confusion, which this PR didn't create, but which it 
also doesn't really address is that the arrow schema provided may not be what 
the reader actually uses. If say the arrow schema says TimestampNanoseconds, 
but the parquet is actually TimestampMilliseconds, IIRC it will return 
TimestampMilliseconds.
   
   My experience is that if the hint schema is provided but doesn't match what 
is read from the file, an error is raised: 
   
   
https://github.com/apache/arrow-rs/blob/812160005efe3afc63531b8ea051e1fa44a91f67/parquet/src/arrow/arrow_reader/mod.rs#L541-L540
   
   > called `Result::unwrap()` on an `Err` value: ArrowError("incompatible 
arrow schema, the following fields could not be cast: [column1]")
   
   The error is actually pretty bad. I'll make a new PR to imprve that.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] Document Arrow <--> Parquet schema conversion better [arrow-rs]

Reply via email to