GitHub user dhruve opened a pull request:
https://github.com/apache/spark/pull/21601
[SPARK-24610] fix reading small files via wholeTextFiles
## What changes were proposed in this pull request?
The `WholeTextFileInputFormat` determines the `maxSplitSize` for the file/s
being read using the `wholeTextFiles` method. While this works well for large
files, for smaller files where the maxSplitSize is smaller than the defaults
being used with configs like hive-site.xml or explicitly passed in the form of
`mapreduce.input.fileinputformat.split.minsize.per.node` or
`mapreduce.input.fileinputformat.split.minsize.per.rack` , it just throws up an
exception.
```java
java.io.IOException: Minimum split size pernode 123456 cannot be larger
than maximum split size 9962
at
org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat.getSplits(CombineFileInputFormat.java:200)
at
org.apache.spark.rdd.WholeTextFileRDD.getPartitions(WholeTextFileRDD.scala:50)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)
at
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2096)
at org.apache.spark.rdd.RDD.count(RDD.scala:1158)
... 48 elided
`
This change checks the maxSplitSize against the minSplitSizePerNode and
minSplitSizePerRack and set them if `maxSplitSize < minSplitSizePerNode/Rack`
## How was this patch tested?
Test manually setting the conf while launching the job and added unit test.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/dhruve/spark bug/SPARK-24610
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/21601.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #21601
----
commit 2369e3acee730b7d4e45175870de0ecac601069b
Author: Dhruve Ashar <dhruveashar@...>
Date: 2018-06-20T16:34:36Z
[SPARK-24610] fix reading small files via wholeTextFiles
----
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]