[ 
https://issues.apache.org/jira/browse/SPARK-10143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14707112#comment-14707112
 ] 

Yin Huai commented on SPARK-10143:
----------------------------------

I did a test yesterday that scans a table with 1824 files (file size's range is 
probably 80MB to 280MB and row group size is 128MB) in S3. To expose the 
overhead of those empty tasks, I did not read any column. My cluster had 16 
cores in total. Without changing the min split setting, my scan job got 5023 
tasks and the job took 102s to finish on average. After I changed the min split 
size to 400MB, I got one task per file and the job took 42s to finish on 
average (btw, one task per file was the behavior I got when I used parquet 
1.6.0rc3 in Spark 1.4). I will test https://github.com/apache/spark/pull/8346 
with the same setting later.

> Parquet changed the behavior of calculating splits
> --------------------------------------------------
>
>                 Key: SPARK-10143
>                 URL: https://issues.apache.org/jira/browse/SPARK-10143
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.5.0
>            Reporter: Yin Huai
>            Priority: Critical
>
> When Parquet's task side metadata is enabled (by default it is enabled and it 
> needs to be enabled to deal with tables with many files), Parquet delegates 
> the work of calculating initial splits to FileInputFormat (see 
> https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetInputFormat.java#L301-L311).
>  If filesystem's block size is smaller than the row group size and users do 
> not set min split size, splits in the initial split list will have lots of 
> dummy splits and they contribute to empty tasks (because the starting point 
> and ending point of a split does not cover the starting point of a row 
> group). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to