Re: [PR] [SPARK-52482][SQL][CORE] ZStandard support for file data source reader [spark]

via GitHub Mon, 16 Jun 2025 18:31:44 -0700


sandip-db commented on PR #51182:
URL: https://github.com/apache/spark/pull/51182#issuecomment-2978639268


   > Even if you don't want to touch the Hadoop code, this PR approach looks 
too overkill, Hadoop provides io.compression.codecs.CompressionCodec to allow 
implementing custom codecs, implementing a 
org.apache.spark.xxx.SparkZstdCompressionCodec and configuring 
io.compression.codecs should work.
   
   @pan3793 Thanks for your comment. The PR adds less than 50 lines of code to 
re-use Spark's `ZStdCompressionCodec` for file data source. The inlining of 
Hadoop's `LineRecordReader` would be needed regardless of ZSTD support because 
of a follow-up change that I am working on, which would add allow users to 
specify compression type using file data source option. There are users who 
have files with either non-standard extensions or no extensions at all. 
Hadoop's way of determining codec based on file name extension is forcing them 
to rename their files.
   
   Implementing `CompressionCodec` and `Decompressor` seems unnecessary at this 
point. At some point, I expect Spark will upgrade to Hadoop with native ZSTD 
support and then the code will start to use that automatically instead of 
Spark's `ZStdCompressionCodec`.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [SPARK-52482][SQL][CORE] ZStandard support for file data source reader [spark]

Reply via email to