[GitHub] [spark] AngersZhuuuu edited a comment on pull request #34308: [SPARK-37035][SQL] Improve error message when use parquet vectorize reader

GitBox Wed, 10 Nov 2021 06:25:58 -0800


AngersZhuuuu edited a comment on pull request #34308:
URL: https://github.com/apache/spark/pull/34308#issuecomment-965262139



   After deep check this case, this exception should be caused when we have 
parquet files with different schema, then  if we don't set mergeSchema, then it 
will directly use the first file's schema to read data, so first file is long 
type and next file is int type for one column, when read the second file, it 
will read long data but the column's descriptor is int type, then will throw 
such Unsupported decoding problem.
   
   But such caused should have been denied by @sunchao 's this pr 
https://github.com/apache/spark/pull/32777
   And such case will be denied when call 
`ParquetVectorUpdaterFactory.getUpdator()`  and will throw exception will file 
path.
   
   ```
   [info]   Cause: org.apache.spark.sql.execution.QueryExecutionException: 
Parquet column cannot be converted in file 
file:///Users/yi.zhu/Documents/project/Angerszhuuuu/spark/target/tmp/spark-3eccc50d-9d9c-4970-9674-87de46ea1192/test-002.parquet/part-00000-4332031b-e514-4b95-b52a-e8d798c999e6-c000.parquet.
 Column: [a], Expected: bigint, Found: INT32
   [info]   at 
org.apache.spark.sql.errors.QueryExecutionErrors$.unsupportedSchemaColumnConvertError(QueryExecutionErrors.scala:586)
   [info]   at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:172)
   [info]   at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93)
   ```
   
   
   Thanks all for your help. @sunchao @cloud-fan @sadikovi .
   Hope your confirm and then I will close this one.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] AngersZhuuuu edited a comment on pull request #34308: [SPARK-37035][SQL] Improve error message when use parquet vectorize reader

Reply via email to