Yin Huai created SPARK-10143:
--------------------------------
Summary: Parquet changed the behavior of calculating splits
Key: SPARK-10143
URL: https://issues.apache.org/jira/browse/SPARK-10143
Project: Spark
Issue Type: Bug
Affects Versions: 1.5.0
Reporter: Yin Huai
Priority: Critical
When Parquet's task side metadata is enabled (by default it is enabled and it
needs to be enabled to deal with tables with many files), Parquet delegates the
work of calculating initial splits to FileInputFormat (see
https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetInputFormat.java#L301-L311).
If filesystem's block size is smaller than the row group size and users do not
set min split size, splits in the initial split list will have lots of dummy
splits and they contribute to empty tasks (because the starting point and
ending point of a split does not cover the starting point of a row group).
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]