GitHub user 10110346 opened a pull request:
https://github.com/apache/spark/pull/22725
[SPARK-24610][[CORE][FOLLOW-UP]fix reading small files via BinaryFileRDD
## What changes were proposed in this pull request?
This is a follow up of #21601, `StreamFileInputFormat` and
`WholeTextFileInputFormat` have the same problem.
`Minimum split size pernode 5123456 cannot be larger than maximum split
size 4194304
java.io.IOException: Minimum split size pernode 5123456 cannot be larger
than maximum split size 4194304
at
org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat.getSplits(CombineFileInputFormat.java:
201)
at
org.apache.spark.rdd.BinaryFileRDD.getPartitions(BinaryFileRDD.scala:52)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:254)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:252)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2138)`
## How was this patch tested?
Added a unit test
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/10110346/spark maxSplitSize_node_rack
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/22725.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #22725
----
commit 54ffcdb7a18471a7a24fe36a000ca0cc4e8d0eba
Author: liuxian <liu.xian3@...>
Date: 2018-10-15T07:28:31Z
fix
----
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]