pan3793 commented on PR #51182: URL: https://github.com/apache/spark/pull/51182#issuecomment-2982827433
> does Hadoop's LineRecordReader allow us to specify the compression at the session level without forking the code? @cloud-fan as @sandip-db said, Hadoop's `LineRecordReader` relies on the filename suffix to choose the decompressor, I think this is a well-known behavior for processing text files, because unlike binary files (e.g. Parquet/ORC) which has metadata in footer to mark the codec used by each page/chunk inside binary file, compression is applied to the whole text file. how do you define the behavior of "specify the compression at the session level"? always respect session conf and ignore filename suffix? or fallback to use codec suggested by session conf when something goes wrong? also, please be careful with that Hadoop codec may have different behavior with Spark/Unix tool codec, for example, HADOOP-12990(lz4) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
