pan3793 commented on PR #42336:
URL: https://github.com/apache/spark/pull/42336#issuecomment-1665860195

   Let me supply my use case to help the reviewer evaluate the benefit of this 
change.
   
   Internally, most of the Spark jobs write Parquet/ORC files using Hive 
Serde(obviously, for compatibility with multiple computing engines mixed-use 
cases). Recently, we are working on promoting the zstd compression algorithm on 
the data warehouse, and one blocker is that we can not identify the compression 
algorithm of Parquet/ORC files without reading the file.
   
   For large HDFS clusters which may store over a billion files, opening files 
and reading the metadata to check the compression algorithm is quite costly 
..., with this change, we can dump all file names from the NameNode and simply 
check the file name to know that.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to