liukun4515 commented on PR #3431: URL: https://github.com/apache/arrow-rs/pull/3431#issuecomment-1370615611
> > I think the data in the arrow ecosystem is exchanged by IPC format > > Sometimes, but an important property is that data written by one implementation to CSV, Parquet, or whatever can be read by another > why is it related to other file format? The changes just enhance the writing for parquet file format, and it will not impact the CSV and other file format. > To phrase my concern differently, decimals are a relatively esoteric type, with most arrow implementations having limited support. I worry with this PR we will now write decimal data in such a way arrow implementations that used to understand it, now won't. > > Can you confirm pyarrow at least can correctly read the data written by this PR? From https://github.com/apache/parquet-cpp/blob/master/src/parquet/arrow/reader.cc#L1227, the c++ support reading the decimal data from INT32/INT64, but c++ does not support writing decimal using the INT32/INT64 parquet physical type https://github.com/apache/parquet-cpp/blob/master/src/parquet/arrow/writer.cc#L811, this is consistent with the comments for the arrow writing parquet. ``` DECIMAL | INT32 / INT64 / BYTE_ARRAY / FIXED_LENGTH_BYTE_ARRAY | Decimal128 / Decimal256 | (2) (2) On the write side, a FIXED_LENGTH_BYTE_ARRAY is always emitted. ``` The writing path of go is same with the c++. go: https://github.com/apache/arrow/blob/master/go/parquet/pqarrow/schema.go#L303 But I can't find the writing path for the pyarrow. @tustvold But all languages support reading the decimal from INT32/INT64/FIXED_BYTE_ARRAY/BYTE_ARRAY from parquet file. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
