[
https://issues.apache.org/jira/browse/SPARK-10143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14707500#comment-14707500
]
Yin Huai commented on SPARK-10143:
----------------------------------
Just a note about this change. If the parallelism is not enough because of this
change, users can decrease the {{parquet.block.size}} set in the hadoop conf or
through
{{org.apache.spark.deploy.SparkHadoopUtil.get.conf.set("parquet.block.size",
"new value")}}, and/or also set {{mapred.min.split.size}} to a lower number.
> Parquet changed the behavior of calculating splits
> --------------------------------------------------
>
> Key: SPARK-10143
> URL: https://issues.apache.org/jira/browse/SPARK-10143
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 1.5.0
> Reporter: Yin Huai
> Assignee: Yin Huai
> Priority: Critical
> Fix For: 1.5.0
>
>
> When Parquet's task side metadata is enabled (by default it is enabled and it
> needs to be enabled to deal with tables with many files), Parquet delegates
> the work of calculating initial splits to FileInputFormat (see
> https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetInputFormat.java#L301-L311).
> If filesystem's block size is smaller than the row group size and users do
> not set min split size, splits in the initial split list will have lots of
> dummy splits and they contribute to empty tasks (because the starting point
> and ending point of a split does not cover the starting point of a row
> group).
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]