dongjoon-hyun commented on pull request #32051:
URL: https://github.com/apache/spark/pull/32051#issuecomment-813096687


   @MaxGekk . It's for consistency inside Apache Spark (and Apache Parquet).
   
   If you are asking about our history, Apache Spark has been using `zstd` in 
event logs since Apache Spark 2.3.0, 
   ```
   $ ls -al *zstd      
   -rw-rw----  1 dongjoon  staff   33101 Apr  4 13:30 local-1617568193664.zstd
   -rw-r--r--@ 1 dongjoon  staff  856181 Nov  5 01:16 
spark-6248534cdfc14ae698cddd2701ab61b3.zstd
   ```
   
   Apache Parquet has been using `.zstd` since Apache Parquet 1.10.0 
(PARQUET-1143).
   - 
https://github.com/apache/parquet-mr/blob/master/parquet-common/src/main/java/org/apache/parquet/hadoop/metadata/CompressionCodecName.java#L33
   
   As a result, Apache Spark's generated Parquet ZSTD files are already `.zstd` 
(in the ZSTD-enabled Hadoop before) because we delegate the extension name to 
Apache Parquet. And, from Apache Spark 3.2.0, we can use this file names for 
all environment.
   ```
   $ ls -al /tmp/p_zstd                      
   total 24
   drwxr-xr-x   6 dongjoon  wheel  192 Apr  4 13:27 .
   drwxrwxrwt  16 root      wheel  512 Apr  4 13:27 ..
   -rw-r--r--   1 dongjoon  wheel    8 Apr  4 13:27 ._SUCCESS.crc
   -rw-r--r--   1 dongjoon  wheel   12 Apr  4 13:27 
.part-00000-17277a00-8e7f-4797-b0b6-6a30ded6617e-c000.zstd.parquet.crc
   -rw-r--r--   1 dongjoon  wheel    0 Apr  4 13:27 _SUCCESS
   -rw-r--r--   1 dongjoon  wheel  511 Apr  4 13:27 
part-00000-17277a00-8e7f-4797-b0b6-6a30ded6617e-c000.zstd.parquet
   ```
   
   Also, cc @viirya .


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to