sandip-db commented on PR #51182:
URL: https://github.com/apache/spark/pull/51182#issuecomment-2986808266
@pan3793 Thanks for your input.
> how do you define the behavior of "specify the compression at the session
level"? always respect session conf and ignore filename suffix? or fallback to
use codec suggested by session conf when something goes wrong?
User should be able to specify the compression type in their query using
data source reader option. For example:
`spark.read.option("compression", "gzip").json(path)`
If this option is specified, Spark will always use the compression type
specified by the user. There are some other alternatives available like first
use the file path extension or use the magic number (used by the file utility)
to determine the codec type.
> Instead of relying on the filename suffix or Spark session conf to choose
the codec, I wonder if the Magic Number was considered? For example, the file
command can correctly recognize the file codec even without standard filename
extension
I agree and I wonder why Hadoop didn't do this in the first place. There
will be some performance penalty for examining the magic number and reopening
the file input stream with appropriate codec.
Having said that, we can take the discussion of the compression option and
the use of magic number to my next PR.
The current PR is about adding ZSTD decompression support. I would
appreciate if we can close this first.
The try-catch logic has been used to pick codec in the following order:
- User provided ZSTD codec via `io.compression.codecs.CompressionCodec`
Hadoop Conf
- Hadoop's default ZSTD codec
- Spark's `ZStdCompressionCodec`.
For compatibility check, I have added some more tests that reads files
compressed with ZSTD in ubuntu (version: 1.4.4+dfsg-3ubuntu0.1).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]