[GitHub] [spark] panbingkun commented on a diff in pull request #37591: [SPARK-40158][SQL] Remove useless configuration & extract common code for parquet read

GitBox Sun, 21 Aug 2022 19:47:21 -0700


panbingkun commented on code in PR #37591:
URL: https://github.com/apache/spark/pull/37591#discussion_r950960157



##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala:
##########
@@ -199,33 +199,7 @@ class ParquetFileFormat
       filters: Seq[Filter],
       options: Map[String, String],
       hadoopConf: Configuration): (PartitionedFile) => Iterator[InternalRow] = 
{
-    hadoopConf.set(ParquetInputFormat.READ_SUPPORT_CLASS, 
classOf[ParquetReadSupport].getName)
-    hadoopConf.set(
-      ParquetReadSupport.SPARK_ROW_REQUESTED_SCHEMA,
-      requiredSchema.json)
-    hadoopConf.set(
-      ParquetWriteSupport.SPARK_ROW_SCHEMA,

Review Comment:
   > It looks like this configuration is read at
   > 
   > 
https://github.com/apache/spark/blob/cf1a80eeae8bf815270fb39568b1846c2bd8d437/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetWriteSupport.scala#L97-L99
   > 
   > Given that usage, it doesn't seem immediately self-evident that this 
configuration is no longer needed. Can you please explain in more detail why 
you think that it is safe to remove?
   
   - The data **WRITTING** process of parquet format is as follows:
   > 1.ParquetFileFormat.prepareWrite or ParquetWrite.prepareWrite
   > 2.ParquetWriteSupport.setSchema(dataSchema, conf) --- The attribute with 
key: ParquetWriteSupport.SPARK_ROW_SCHEMA is set temporarily.
   > 3.ParquetWriteSupport --- get the attribute with key: 
ParquetWriteSupport.SPARK_ROW_SCHEMA.
   ParquetWriteSupport.SPARK_ROW_SCHEMA - It is a kind of intermediate 
transmission.
   
   - The data **READING** process of parquet format is as follows:
   > 1.ParquetFileFormat.buildReaderWithPartitionValues or 
ParquetScan.createReaderFactory
   > 2.hadoopConf.set(ParquetWriteSupport.SPARK_ROW_SCHEMA, 
readDataSchemaAsJson) --- The attribute with key: 
ParquetWriteSupport.SPARK_ROW_SCHEMA is set temporarily.
   > 3.The following code does not have any logic to use the above 
configuration(ParquetWriteSupport.SPARK_ROW_SCHEMA).
   
   - Delete the above useless configuration settings only in the logic read by 
parquet.
   - Even in the parquet source code, I haven't grep to any place where this 
key is used.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] panbingkun commented on a diff in pull request #37591: [SPARK-40158][SQL] Remove useless configuration & extract common code for parquet read

Reply via email to