sandip-db commented on PR #51182: URL: https://github.com/apache/spark/pull/51182#issuecomment-2978639268
> Even if you don't want to touch the Hadoop code, this PR approach looks too overkill, Hadoop provides io.compression.codecs.CompressionCodec to allow implementing custom codecs, implementing a org.apache.spark.xxx.SparkZstdCompressionCodec and configuring io.compression.codecs should work. @pan3793 Thanks for your comment. The PR adds less than 50 lines of code to re-use Spark's `ZStdCompressionCodec` for file data source. The inlining of Hadoop's `LineRecordReader` would be needed regardless of ZSTD support because of a follow-up change that I am working on, which would add allow users to specify compression type using file data source option. There are users who have files with either non-standard extensions or no extensions at all. Hadoop's way of determining codec based on file name extension is forcing them to rename their files. Implementing `CompressionCodec` and `Decompressor` seems unnecessary at this point. At some point, I expect Spark will upgrade to Hadoop with native ZSTD support and then the code will start to use that automatically instead of Spark's `ZStdCompressionCodec`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
