panbingkun commented on code in PR #37591:
URL: https://github.com/apache/spark/pull/37591#discussion_r950960157
##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala:
##########
@@ -199,33 +199,7 @@ class ParquetFileFormat
filters: Seq[Filter],
options: Map[String, String],
hadoopConf: Configuration): (PartitionedFile) => Iterator[InternalRow] =
{
- hadoopConf.set(ParquetInputFormat.READ_SUPPORT_CLASS,
classOf[ParquetReadSupport].getName)
- hadoopConf.set(
- ParquetReadSupport.SPARK_ROW_REQUESTED_SCHEMA,
- requiredSchema.json)
- hadoopConf.set(
- ParquetWriteSupport.SPARK_ROW_SCHEMA,
Review Comment:
> It looks like this configuration is read at
>
>
https://github.com/apache/spark/blob/cf1a80eeae8bf815270fb39568b1846c2bd8d437/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetWriteSupport.scala#L97-L99
>
> Given that usage, it doesn't seem immediately self-evident that this
configuration is no longer needed. Can you please explain in more detail why
you think that it is safe to remove?
The data **writing** process of parquet format is as follows:
1.ParquetFileFormat.prepareWrite or ParquetWrite.prepareWrite
2.ParquetWriteSupport.setSchema(dataSchema, conf) --- The attribute with
key: ParquetWriteSupport.SPARK_ROW_SCHEMA is set temporarily.
3.ParquetWriteSupport --- get the attribute with key:
ParquetWriteSupport.SPARK_ROW_SCHEMA.
ParquetWriteSupport.SPARK_ROW_SCHEMA - It is a kind of intermediate
transmission.
The data **reading** process of parquet format is as follows:
1.ParquetFileFormat.buildReaderWithPartitionValues or
ParquetScan.createReaderFactory
2.hadoopConf.set(ParquetWriteSupport.SPARK_ROW_SCHEMA, readDataSchemaAsJson)
--- The attribute with key: ParquetWriteSupport.SPARK_ROW_SCHEMA is set
temporarily.
3.The following code does not have any logic to use the above
configuration(ParquetWriteSupport.SPARK_ROW_SCHEMA).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]