[
https://issues.apache.org/jira/browse/SPARK-25753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Thomas Graves resolved SPARK-25753.
-----------------------------------
Resolution: Fixed
Fix Version/s: 3.0.0
> binaryFiles broken for small files
> ----------------------------------
>
> Key: SPARK-25753
> URL: https://issues.apache.org/jira/browse/SPARK-25753
> Project: Spark
> Issue Type: Bug
> Components: Input/Output
> Affects Versions: 3.0.0
> Reporter: liuxian
> Assignee: liuxian
> Priority: Minor
> Fix For: 3.0.0
>
>
> _{{StreamFileInputFormat}}_ and
> {{_WholeTextFileInputFormat_(https://issues.apache.org/jira/browse/SPARK-24610)}}
> have the same problem: for small sized files, the computed maxSplitSize by
> `_{{StreamFileInputFormat}}_ ` is way smaller than the default or commonly
> used split size of 64/128M and spark throws an exception while trying to read
> them.
> {{Exception info:}}
> _{{Minimum split size pernode 5123456 cannot be larger than maximum split
> size 4194304 java.io.IOException: Minimum split size pernode 5123456 cannot
> be larger than maximum split size 4194304 at
> org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat.getSplits(CombineFileInputFormat.java:
> 201) at
> org.apache.spark.rdd.BinaryFileRDD.getPartitions(BinaryFileRDD.scala:52) at
> org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:254) at
> org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252) at
> scala.Option.getOrElse(Option.scala:121) at
> org.apache.spark.rdd.RDD.partitions(RDD.scala:252) at
> org.apache.spark.SparkContext.runJob(SparkContext.scala:2138)}}_
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]