Github user srowen commented on the issue:

    https://github.com/apache/spark/pull/12004
  
    If Hadoop 2.5 vs 2.6 behaves differently w.r.t. S3 support classes, we can 
vary dependencies within the existing profile even, sure. That should be fixed 
up. However I think we may be juust about to drop 2.5 support anyway? that 
could simplify this.
    
    
    I get the idea of a small dependency-only module that includes a bunch of 
optional `hadoop-*` modules that contain support code that is specific to 
Hadoop + cloud-specific APIs. Are these integration libraries something users 
could supply in their app? maybe not. The module makes some sense then, so 
people can build in cloud-specific SDK support if they want.
    
    Docs are pretty uncontroversial, especially cloud-specific notes about 
config params and how to set them. That seems helpful.
    
    
    However I also see a lot of cloud-specific tests and examples. I didn't 
expect that. Is there new different functionality in Spark that only turns up 
in a cloud context? I see this is actually adding some new utility methods and 
new RDD API-like methods like saveAsTextFile(). I thought this would just be 
about making it easy to get the Hadoop API machinery set up to access 
cloud-specific storage.
    
    These tests couldn't be enabled on Spark Jenkins, right? at least, it would 
mean budgeting to run them and all that. If this is about making SDK 
integration easier, do we need specific tests? it seems to be more about 
testing the SDK and cloud service than anything, and prone to false positives.
    
    Not that it isn't useful, just trying to figure out how to reduce this to 
something less massive, at least to start.
    
    
    I don't know the origin of the feature branch comment -- is this referring 
to maintaining separate branches for major lines of development within Spark's 
primary git repo, and not just release branches? I actually don't quite like 
that. Downsides? Such branches are quasi-official when it's not clear they 
deserve that status more than others' collaborations. Enabling development off 
master for extended periods tends to let people do a lightweight fork and 
continue development without the forcing mechanism of getting review or buy in 
early. This leads to long-running dead ends, or "too big to not upmerge" 
feature branch battles. Or you get, well, forks. Wasn't this kind of how Hadoop 
ended up with a different "security" release branch a long time ago?
    
    The upside is collaborating on something that isn't master, but, git makes 
that trivial now. Yes the risk is that the collaboration is therefore not 
forcibly coupled to Spark's git repo, but we really only need that any such 
repo is public and open and shared on official channels. It's not like people 
can't collaborate, privately even, today, so not a new thing.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to