cloud-fan commented on pull request #31319: URL: https://github.com/apache/spark/pull/31319#issuecomment-767528646
I've taken a close look at the parquet read path, the data flow is: 1. The parquet MR reader (`ParquetRowConverter`) reads data from parquet files, creates `Decimal` objects, and stores the `Decimal` object in `SpecificInternalRow`. 2. `ParquetFileFormat` converts `SpecificInternalRow` to `UnsafeRow` before returning the rows to downstream operators. The conversion is done by `UnsafeRowWriter`. When it writes decimal, it calls `Decimal.changePrecision` which is kind of a cast to make sure the decimal values match the precision/scale in the corresponding decimal type. For the vectorized reader, we read data from parquet files and write to `ColumnVector` with binary format directly, which is hard for us to apply a "cast" and fix the decimal precision/scale mismatch. If we do so, we need to create `Decimal` objects in the middle, which may slow down the vectorized reader a lot. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
