[ 
https://issues.apache.org/jira/browse/SPARK-53876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18033870#comment-18033870
 ] 

Dongjoon Hyun commented on SPARK-53876:
---------------------------------------

Apache Spark community has a policy which manages `Fix Version` and `Target 
Version` like the following. So, please don't set it when you file a JIRA issue.
https://spark.apache.org/contributing.html
{quote}Do not set the following fields:
- Fix Version. This is assigned by committers only when resolved.
- Target Version. This is assigned by committers to indicate a PR has been 
accepted for possible fix by the target version.{quote}

> Addition of column-level Parquet compression preference in Spark
> ----------------------------------------------------------------
>
>                 Key: SPARK-53876
>                 URL: https://issues.apache.org/jira/browse/SPARK-53876
>             Project: Spark
>          Issue Type: Improvement
>          Components: PySpark, Spark Submit, SQL
>    Affects Versions: 4.1.0
>         Environment: Spark Version: 4.1.0 (open-source)
> Deployment: Spark on Kubernetes (GKE)
> Language: PySpark + Scala
> Delta Lake: 3.0.0
> OS: Ubuntu 22.04
> Java: OpenJDK 17 (Zulu)
> Cluster: GKE N2D (AMD EPYC), 8 vCPU / 32 GB per executor
>            Reporter: Prajwal H G
>            Priority: Major
>              Labels: compression
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> h4. *Problem*
> Apache Spark currently allows only *global compression configuration* for 
> Parquet files using:
> {{spark.sql.parquet.compression.codec = snappy | gzip | zstd | uncompressed}}
> However, many production datasets contain heterogeneous columns — for example:
>  * text or categorical columns that compress better with {*}ZSTD{*},
>  * numeric columns that perform better with {*}SNAPPY{*}.
> Today, Spark applies a single codec to the entire file, preventing users from 
> optimizing storage and I/O performance per column.
> h4. Proposed Improvement
> Introduce a new configuration key to define *per-column compression codecs* 
> in a map format:
> {{spark.sql.parquet.column.compression.map = colA:zstd,colB:snappy,colC:gzip}}
> *Behavior:*
>  * The global codec ({{{}spark.sql.parquet.compression.codec{}}}) remains the 
> default for all columns.
>  * Any column listed in {{spark.sql.parquet.column.compression.map}} will use 
> its specified codec.
>  * Unspecified columns continue to use the global codec.
> *Example:*
> {{--conf spark.sql.parquet.compression.codec=snappy \}}
> {{--conf 
> spark.sql.parquet.column.compression.map="country:zstd,price:snappy,comment:gzip"
> Effect:-
> }}
> ||Column||Codec||
> |country|zstd|
> |price|snappy|
> |comment|gzip|
> |all others|snappy (global default)|
> {{}}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to