Github user pwendell commented on the pull request:

    https://github.com/apache/spark/pull/4094#issuecomment-70436485
  
    Yeah, this has always been broken. What's even more confusing is what 
Hadoop actually does with this minSplits if you trace the code through Hadoop - 
I remember looking through it and the logic on the Hadoop side is really 
complicated. @idanz can you create a JIRA for this? Also, can you explain what 
Hadoop is actually doing with this parameter - IIRC it's not as simple as what 
it appears to be.
    
    An issue with changing this is that we could cause behavior to change in a 
very unexpected way for Hadoop RDD's. Right now this is effectively a no-op 
because it is almost always set to 2. I've only seen it affect things when 
someone is running a file in local mode that really could have been processed 
with a single spit.
    
    If we change it, it could affect user applications a bunch. For instance in 
a large cluster it will actually cause all reads of Hadoop files to be split 
over "# cores" tasks, even if there are just a small amount of data in the 
file. That might not be desirable.
    
    I wonder if we should just set it to 2 (i.e. hard code it) and just add a 
note saying it's set this way for legacy reasons, and that really users should 
pass in their own "minSplits" when creating a hadoopRDD if they want to control 
the read splits.
    



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to