Github user pwendell commented on the pull request:
https://github.com/apache/spark/pull/4094#issuecomment-70436485
Yeah, this has always been broken. What's even more confusing is what
Hadoop actually does with this minSplits if you trace the code through Hadoop -
I remember looking through it and the logic on the Hadoop side is really
complicated. @idanz can you create a JIRA for this? Also, can you explain what
Hadoop is actually doing with this parameter - IIRC it's not as simple as what
it appears to be.
An issue with changing this is that we could cause behavior to change in a
very unexpected way for Hadoop RDD's. Right now this is effectively a no-op
because it is almost always set to 2. I've only seen it affect things when
someone is running a file in local mode that really could have been processed
with a single spit.
If we change it, it could affect user applications a bunch. For instance in
a large cluster it will actually cause all reads of Hadoop files to be split
over "# cores" tasks, even if there are just a small amount of data in the
file. That might not be desirable.
I wonder if we should just set it to 2 (i.e. hard code it) and just add a
note saying it's set this way for legacy reasons, and that really users should
pass in their own "minSplits" when creating a hadoopRDD if they want to control
the read splits.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]