[ 
https://issues.apache.org/jira/browse/SPARK-34276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17408324#comment-17408324
 ] 

Chao Sun commented on SPARK-34276:
----------------------------------

I did some study on the code and it seems this will only affect Spark when 
{{spark.sql.hive.convertMetastoreParquet}} is set to false, as [~nemon] pointed 
above. By default Spark uses {{filterFileMetaDataByMidpoint}} (see 
[here|https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java#L1226]),
 which is not impacted much by this bug. In the worst case it could cause 
imbalance when assigning Parquet row groups to Spark tasks but nothing like 
read error or data loss.

> Check the unreleased/unresolved JIRAs/PRs of Parquet 1.11 and 1.12
> ------------------------------------------------------------------
>
>                 Key: SPARK-34276
>                 URL: https://issues.apache.org/jira/browse/SPARK-34276
>             Project: Spark
>          Issue Type: Task
>          Components: Build, SQL
>    Affects Versions: 3.2.0
>            Reporter: Yuming Wang
>            Priority: Blocker
>
> Before the release, we need to double check the unreleased/unresolved 
> JIRAs/PRs of Parquet 1.11/1.12 and then decide whether we should 
> upgrade/revert Parquet. At the same time, we should encourage the whole 
> community to do the compatibility and performance tests for their production 
> workloads, including both read and write code paths.
> More details: 
> [https://github.com/apache/spark/pull/26804#issuecomment-768790620]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to