[GitHub] [spark] c21 commented on a diff in pull request #37263: [SPARK-39743][SQL] Unable to set zstd compression level while writing…

GitBox Sun, 24 Jul 2022 02:50:55 -0700


c21 commented on code in PR #37263:
URL: https://github.com/apache/spark/pull/37263#discussion_r928229102



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala:
##########
@@ -922,6 +922,22 @@ object SQLConf {
     .checkValues(Set("none", "uncompressed", "snappy", "gzip", "lzo", "lz4", 
"brotli", "zstd"))
     .createWithDefault("snappy")
 
+  val PARQUET_COMPRESSION_ZSTD_LEVEL = 
buildConf("spark.sql.parquet.zstd.level")
+    .doc("Sets the zstd level when writing Parquet files and compression codec 
is `zstd`. " +
+      "The valid range is 1~22. Generally the higher compression level, the 
higher compression " +
+      "ratio can be achieved, but the writing time will be longer.")
+    .version("3.4.0")
+    .intConf
+    .createWithDefault(3)

Review Comment:
   curious how we come to level 3 by default? Any data or benchmark backed for 
the choice?



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala:
##########
@@ -922,6 +922,22 @@ object SQLConf {
     .checkValues(Set("none", "uncompressed", "snappy", "gzip", "lzo", "lz4", 
"brotli", "zstd"))
     .createWithDefault("snappy")
 
+  val PARQUET_COMPRESSION_ZSTD_LEVEL = 
buildConf("spark.sql.parquet.zstd.level")
+    .doc("Sets the zstd level when writing Parquet files and compression codec 
is `zstd`. " +
+      "The valid range is 1~22. Generally the higher compression level, the 
higher compression " +
+      "ratio can be achieved, but the writing time will be longer.")
+    .version("3.4.0")
+    .intConf
+    .createWithDefault(3)
+
+  val PARQUET_COMPRESSION_ZSTD_WORKERS = 
buildConf("spark.sql.parquet.zstd.workers")
+    .doc("Sets the zstd workers when writing Parquet files and compression 
codec is `zstd`. " +
+      "The number of threads will be spawned to compress in parallel. More 
workers improve " +
+      "speed, but also increase memory usage. When it is 0, it works as 
single-threaded mode.")
+    .version("3.4.0")
+    .intConf
+    .createWithDefault(0)

Review Comment:
   curious is it the default value used by Parquet?



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala:
##########
@@ -922,6 +922,22 @@ object SQLConf {
     .checkValues(Set("none", "uncompressed", "snappy", "gzip", "lzo", "lz4", 
"brotli", "zstd"))
     .createWithDefault("snappy")
 
+  val PARQUET_COMPRESSION_ZSTD_LEVEL = 
buildConf("spark.sql.parquet.zstd.level")

Review Comment:
   what about the existing config `spark.io.compression.zstd.level`? Can't we 
just reuse the existing config instead of introducing new ones for each file 
format?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] c21 commented on a diff in pull request #37263: [SPARK-39743][SQL] Unable to set zstd compression level while writing…

Reply via email to