pan3793 commented on PR #42336: URL: https://github.com/apache/spark/pull/42336#issuecomment-1665860195
Let me supply my use case to help the reviewer evaluate the benefit of this change. Internally, most of the Spark jobs write Parquet/ORC files using Hive Serde(obviously, for compatibility with multiple computing engines mixed-use cases). Recently, we are working on promoting the zstd compression algorithm on the data warehouse, and one blocker is that we can not identify the compression algorithm of Parquet/ORC files without reading the file. For large HDFS clusters which may store over a billion files, opening files and reading the metadata to check the compression algorithm is quite costly ..., with this change, we can dump all file names from the NameNode and simply check the file name to know that. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
