[
https://issues.apache.org/jira/browse/SPARK-10143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14707148#comment-14707148
]
Ryan Blue commented on SPARK-10143:
-----------------------------------
[~yhuai] if you do that, you will get the current value for the configuration,
not what was used to write the file. If you want to know what the value was
when the file was written, you have to read its footer.
As far as solving the challenge of S3 input splits, if you're running in S3,
why not split the files based on total length? Example:
* 2 files: 500 MB and 700MB
* Want 5 reducers
* Splits: file 1:0-250MB, file 1:250-500MB, file 2:0-250MB, file 2:250-500MB,
file 2:500-700MB
Even without knowing the block size, you can control parallelism. If there are
lots of small blocks (say 64MB block size), then you get approximately what you
wanted. If there are big blocks (256MB) then you are still okay. If you have
gigantic blocks (500MB) then you waste a couple tasks and get as much
parallelism as possible anyway.
> Parquet changed the behavior of calculating splits
> --------------------------------------------------
>
> Key: SPARK-10143
> URL: https://issues.apache.org/jira/browse/SPARK-10143
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 1.5.0
> Reporter: Yin Huai
> Priority: Critical
>
> When Parquet's task side metadata is enabled (by default it is enabled and it
> needs to be enabled to deal with tables with many files), Parquet delegates
> the work of calculating initial splits to FileInputFormat (see
> https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetInputFormat.java#L301-L311).
> If filesystem's block size is smaller than the row group size and users do
> not set min split size, splits in the initial split list will have lots of
> dummy splits and they contribute to empty tasks (because the starting point
> and ending point of a split does not cover the starting point of a row
> group).
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]