Yin Huai created SPARK-10143:
--------------------------------

             Summary: Parquet changed the behavior of calculating splits
                 Key: SPARK-10143
                 URL: https://issues.apache.org/jira/browse/SPARK-10143
             Project: Spark
          Issue Type: Bug
    Affects Versions: 1.5.0
            Reporter: Yin Huai
            Priority: Critical


When Parquet's task side metadata is enabled (by default it is enabled and it 
needs to be enabled to deal with tables with many files), Parquet delegates the 
work of calculating initial splits to FileInputFormat (see 
https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetInputFormat.java#L301-L311).
 If filesystem's block size is smaller than the row group size and users do not 
set min split size, splits in the initial split list will have lots of dummy 
splits and they contribute to empty tasks (because the starting point and 
ending point of a split does not cover the starting point of a row group). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to