dongjoon-hyun commented on pull request #32051: URL: https://github.com/apache/spark/pull/32051#issuecomment-813096687
@MaxGekk . It's for consistency inside Apache Spark (and Apache Parquet). If you are asking about our history, Apache Spark has been using `zstd` in event logs since Apache Spark 2.3.0, ``` $ ls -al *zstd -rw-rw---- 1 dongjoon staff 33101 Apr 4 13:30 local-1617568193664.zstd -rw-r--r--@ 1 dongjoon staff 856181 Nov 5 01:16 spark-6248534cdfc14ae698cddd2701ab61b3.zstd ``` Apache Parquet has been using `.zstd` since Apache Parquet 1.10.0 (PARQUET-1143). - https://github.com/apache/parquet-mr/blob/master/parquet-common/src/main/java/org/apache/parquet/hadoop/metadata/CompressionCodecName.java#L33 As a result, Apache Spark's generated Parquet ZSTD files are already `.zstd` (in the ZSTD-enabled Hadoop before) because we delegate the extension name to Apache Parquet. And, from Apache Spark 3.2.0, we can use this file names for all environment. ``` $ ls -al /tmp/p_zstd total 24 drwxr-xr-x 6 dongjoon wheel 192 Apr 4 13:27 . drwxrwxrwt 16 root wheel 512 Apr 4 13:27 .. -rw-r--r-- 1 dongjoon wheel 8 Apr 4 13:27 ._SUCCESS.crc -rw-r--r-- 1 dongjoon wheel 12 Apr 4 13:27 .part-00000-17277a00-8e7f-4797-b0b6-6a30ded6617e-c000.zstd.parquet.crc -rw-r--r-- 1 dongjoon wheel 0 Apr 4 13:27 _SUCCESS -rw-r--r-- 1 dongjoon wheel 511 Apr 4 13:27 part-00000-17277a00-8e7f-4797-b0b6-6a30ded6617e-c000.zstd.parquet ``` Also, cc @viirya . -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
