[
https://issues.apache.org/jira/browse/SPARK-53876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18033870#comment-18033870
]
Dongjoon Hyun commented on SPARK-53876:
---------------------------------------
Apache Spark community has a policy which manages `Fix Version` and `Target
Version` like the following. So, please don't set it when you file a JIRA issue.
https://spark.apache.org/contributing.html
{quote}Do not set the following fields:
- Fix Version. This is assigned by committers only when resolved.
- Target Version. This is assigned by committers to indicate a PR has been
accepted for possible fix by the target version.{quote}
> Addition of column-level Parquet compression preference in Spark
> ----------------------------------------------------------------
>
> Key: SPARK-53876
> URL: https://issues.apache.org/jira/browse/SPARK-53876
> Project: Spark
> Issue Type: Improvement
> Components: PySpark, Spark Submit, SQL
> Affects Versions: 4.1.0
> Environment: Spark Version: 4.1.0 (open-source)
> Deployment: Spark on Kubernetes (GKE)
> Language: PySpark + Scala
> Delta Lake: 3.0.0
> OS: Ubuntu 22.04
> Java: OpenJDK 17 (Zulu)
> Cluster: GKE N2D (AMD EPYC), 8 vCPU / 32 GB per executor
> Reporter: Prajwal H G
> Priority: Major
> Labels: compression
> Original Estimate: 336h
> Remaining Estimate: 336h
>
> h4. *Problem*
> Apache Spark currently allows only *global compression configuration* for
> Parquet files using:
> {{spark.sql.parquet.compression.codec = snappy | gzip | zstd | uncompressed}}
> However, many production datasets contain heterogeneous columns — for example:
> * text or categorical columns that compress better with {*}ZSTD{*},
> * numeric columns that perform better with {*}SNAPPY{*}.
> Today, Spark applies a single codec to the entire file, preventing users from
> optimizing storage and I/O performance per column.
> h4. Proposed Improvement
> Introduce a new configuration key to define *per-column compression codecs*
> in a map format:
> {{spark.sql.parquet.column.compression.map = colA:zstd,colB:snappy,colC:gzip}}
> *Behavior:*
> * The global codec ({{{}spark.sql.parquet.compression.codec{}}}) remains the
> default for all columns.
> * Any column listed in {{spark.sql.parquet.column.compression.map}} will use
> its specified codec.
> * Unspecified columns continue to use the global codec.
> *Example:*
> {{--conf spark.sql.parquet.compression.codec=snappy \}}
> {{--conf
> spark.sql.parquet.column.compression.map="country:zstd,price:snappy,comment:gzip"
> Effect:-
> }}
> ||Column||Codec||
> |country|zstd|
> |price|snappy|
> |comment|gzip|
> |all others|snappy (global default)|
> {{}}
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]