pan3793 commented on PR #51182:
URL: https://github.com/apache/spark/pull/51182#issuecomment-2982827433

   > does Hadoop's LineRecordReader allow us to specify the compression at the 
session level without forking the code?
   
   @cloud-fan as @sandip-db said, Hadoop's `LineRecordReader` relies on the 
filename suffix to choose the decompressor, I think this is a well-known 
behavior for processing text files, because unlike binary files (e.g. 
Parquet/ORC) which has metadata in footer to mark the codec used by each 
page/chunk inside binary file, compression is applied to the whole text file.
   
   how do you define the behavior of "specify the compression at the session 
level"? always respect session conf and ignore filename suffix? or fallback to 
use codec suggested by session conf when something goes wrong?
   
   also, please be careful with that Hadoop codec may have different behavior 
with Spark/Unix tool codec, for example, HADOOP-12990(lz4)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to