[
https://issues.apache.org/jira/browse/SPARK-10143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Yin Huai updated SPARK-10143:
-----------------------------
Component/s: SQL
> Parquet changed the behavior of calculating splits
> --------------------------------------------------
>
> Key: SPARK-10143
> URL: https://issues.apache.org/jira/browse/SPARK-10143
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 1.5.0
> Reporter: Yin Huai
> Priority: Critical
>
> When Parquet's task side metadata is enabled (by default it is enabled and it
> needs to be enabled to deal with tables with many files), Parquet delegates
> the work of calculating initial splits to FileInputFormat (see
> https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetInputFormat.java#L301-L311).
> If filesystem's block size is smaller than the row group size and users do
> not set min split size, splits in the initial split list will have lots of
> dummy splits and they contribute to empty tasks (because the starting point
> and ending point of a split does not cover the starting point of a row
> group).
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]