[
https://issues.apache.org/jira/browse/SPARK-1415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13960908#comment-13960908
]
Xusen Yin commented on SPARK-1415:
----------------------------------
Hi Matei, I just looked around in those Hadoop APIs. I find that the new Hadoop
API deprecates the minSplit, instead of minSplit, they prefer minSplitSize and
maxSplitSize to control the split. minSplit is negative correlated with
maxSplitSize, so I think we have 2 ways to fix the issue:
1. We just provide a new API with maxSplitSize, say, wholeTextFiles(path:
String, maxSplitSize: Long);
2. We write a delegation to compute the maxSplitSize using minSplit (easy to
write, taking old Hadoop API as an example), and provide the API
wholeTextFile(path: String, minSplit: Int);
I also think we can provide the two APIs simultaneously. What do you think?
> Add a minSplits parameter to wholeTextFiles
> -------------------------------------------
>
> Key: SPARK-1415
> URL: https://issues.apache.org/jira/browse/SPARK-1415
> Project: Spark
> Issue Type: Bug
> Reporter: Matei Zaharia
> Assignee: Xusen Yin
> Labels: Starter
>
> This probably requires adding one to newAPIHadoopFile too.
--
This message was sent by Atlassian JIRA
(v6.2#6252)