GitHub user dhruve opened a pull request:

    https://github.com/apache/spark/pull/21601

    [SPARK-24610] fix reading small files via wholeTextFiles

    ## What changes were proposed in this pull request?
    The `WholeTextFileInputFormat` determines the `maxSplitSize` for the file/s 
being read using the `wholeTextFiles` method. While this works well for large 
files, for smaller files where the maxSplitSize is smaller than the defaults 
being used with configs like hive-site.xml or explicitly passed in the form of 
`mapreduce.input.fileinputformat.split.minsize.per.node` or 
`mapreduce.input.fileinputformat.split.minsize.per.rack` , it just throws up an 
exception.
    
    
    ```java
    java.io.IOException: Minimum split size pernode 123456 cannot be larger 
than maximum split size 9962
    at 
org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat.getSplits(CombineFileInputFormat.java:200)
    at 
org.apache.spark.rdd.WholeTextFileRDD.getPartitions(WholeTextFileRDD.scala:50)
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)
    at scala.Option.getOrElse(Option.scala:121)
    at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)
    at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)
    at scala.Option.getOrElse(Option.scala:121)
    at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:2096)
    at org.apache.spark.rdd.RDD.count(RDD.scala:1158)
    ... 48 elided
    `
    
    This change checks the maxSplitSize against the minSplitSizePerNode and 
minSplitSizePerRack and set them if `maxSplitSize < minSplitSizePerNode/Rack`
    
    ## How was this patch tested?
    Test manually setting the conf while launching the job and added unit test.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/dhruve/spark bug/SPARK-24610

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/21601.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #21601
    
----
commit 2369e3acee730b7d4e45175870de0ecac601069b
Author: Dhruve Ashar <dhruveashar@...>
Date:   2018-06-20T16:34:36Z

    [SPARK-24610] fix reading small files via wholeTextFiles

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to