[GitHub] [spark] cloud-fan commented on pull request #31319: [SPARK-34212][SQL] Fix incorrect decimal reading from Parquet files

GitBox Tue, 26 Jan 2021 05:05:27 -0800


cloud-fan commented on pull request #31319:
URL: https://github.com/apache/spark/pull/31319#issuecomment-767528646



   I've taken a close look at the parquet read path, the data flow is:
   1. The parquet MR reader (`ParquetRowConverter`) reads data from parquet 
files, creates `Decimal` objects, and stores the `Decimal` object in 
`SpecificInternalRow`.
   2. `ParquetFileFormat` converts `SpecificInternalRow` to `UnsafeRow` before 
returning the rows to downstream operators. The conversion is done by 
`UnsafeRowWriter`. When it writes decimal, it calls `Decimal.changePrecision` 
which is kind of a cast to make sure the decimal values match the 
precision/scale in the corresponding decimal type.
   
   For the vectorized reader, we read data from parquet files and write to 
`ColumnVector` with binary format directly, which is hard for us to apply a 
"cast" and fix the decimal precision/scale mismatch. If we do so, we need to 
create `Decimal` objects in the middle, which may slow down the vectorized 
reader a lot.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] cloud-fan commented on pull request #31319: [SPARK-34212][SQL] Fix incorrect decimal reading from Parquet files

Reply via email to