[jira] [Commented] (SPARK-10143) Parquet changed the behavior of calculating splits

Yin Huai (JIRA) Fri, 21 Aug 2015 14:34:02 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-10143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14707500#comment-14707500
 ]


Yin Huai commented on SPARK-10143:
----------------------------------

Just a note about this change. If the parallelism is not enough because of this 
change, users can decrease the {{parquet.block.size}} set in the hadoop conf or 
through 
{{org.apache.spark.deploy.SparkHadoopUtil.get.conf.set("parquet.block.size", 
"new value")}}, and/or also set {{mapred.min.split.size}} to a lower number.

> Parquet changed the behavior of calculating splits
> --------------------------------------------------
>
>                 Key: SPARK-10143
>                 URL: https://issues.apache.org/jira/browse/SPARK-10143
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.5.0
>            Reporter: Yin Huai
>            Assignee: Yin Huai
>            Priority: Critical
>             Fix For: 1.5.0
>
>
> When Parquet's task side metadata is enabled (by default it is enabled and it 
> needs to be enabled to deal with tables with many files), Parquet delegates 
> the work of calculating initial splits to FileInputFormat (see 
> https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetInputFormat.java#L301-L311).
>  If filesystem's block size is smaller than the row group size and users do 
> not set min split size, splits in the initial split list will have lots of 
> dummy splits and they contribute to empty tasks (because the starting point 
> and ending point of a split does not cover the starting point of a row 
> group). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SPARK-10143) Parquet changed the behavior of calculating splits

Reply via email to