Github user cloud-fan commented on a diff in the pull request:
https://github.com/apache/spark/pull/22197#discussion_r213597570
--- Diff:
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala
---
@@ -366,18 +367,29 @@ class ParquetFileFormat
val sharedConf = broadcastedHadoopConf.value.value
- lazy val footerFileMetaData =
+ val footerFileMetaData =
ParquetFileReader.readFooter(sharedConf, filePath,
SKIP_ROW_GROUPS).getFileMetaData
+
+ val parquetRequestedSchema = {
+ val schemaString =
sharedConf.get(ParquetReadSupport.SPARK_ROW_CATALYST_REQUESTED_SCHEMA)
+ assert(schemaString != null, "Catalyst requested schema not set.")
+ val catalystRequestedSchema = StructType.fromString(schemaString)
+ val parquetSchema = footerFileMetaData.getSchema
+ ParquetReadSupport.clipParquetSchema(
+ parquetSchema, catalystRequestedSchema, isCaseSensitive)
+ }
+ sharedConf.set(ParquetReadSupport.SPARK_ROW_PARQUET_REQUESTED_SCHEMA,
--- End diff --
We are already at executor side here, why do we need to set the conf? We
can even pass the `parquetRequestedSchema` to reader via constructor.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]