Github user steveloughran commented on the issue:
https://github.com/apache/spark/pull/12004
Here's why this matters, and why a simple "isn't this just a matter of
dropping in the JARs" isn't the solution:
*getting getting the right jars together with the right spark version is a
non-trivial problem*.
That's essentially it.
1. Does everyone know which version of the AWS SDK is needed for 2.7
(1.7.4)?
1. And whether that is compatible with the version in 2.6 (maybe).
1. Will it be the same in Hadoop 2.8 (yes)
1. Will it be the same in Hadoop 2.9+> No, because as well as a 1.11.x
version, where AWS broke the JARs into AWS S3 SDK, core sdk, etc.
1. Are the transitive dependencies in hadoop branch-2 always the same? No;
for Hadoop 2.9+ you need to declare the version of jackson2-cbor to be
compatible with Spark's, otherwise the the aws-sdk will pull one in which is
incompatible with the one spark declares at the top-level.
1. Are the dependencies in Hadoop 3 going to stay the same? Absolutely not,
we will be adding dynamodb to classpath for s3guard, that consistent world view
and the O(1) committer.
1. What about Azure? Which version of the SDK is there? (2.0.0 for hadoop
2.7.x-2.8.x; 4.2.2 for Hadoop 2.8+, where you also need to undeclare the
dependency on commons-clang3).
You see? It's an unstable transitive graph of things which absolutely need
to be kept 100% in sync with the version of Hadoop which spark was built
against, and the versions of other stuff (jackson, httpclient) which spark also
pulls in, or you end up with stack traces appearing on mailing lists, JIRAs and
stack overflow.
The way to do that is not to have some text file somewhere saying "This is
what you have to do", it's to have a machine readable file which does it, and
ensures that things can be pulled in, shaded together as appropriate, and then
delivered in people's hands. And the metadata can be published as another
artifact into the repo, so if someone downstream wants to get a fully
consistent set of artifacts all they need to do is ask for it in their own
machine readable bit of code
```xml
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-hadoop-cloud_${scala.binary.version}</artifactId>
<version>${spark.version}</version>
</dependency>
```
@srowen
Regarding S3A consistency, you'll be wanting
[S3guard](https://issues.apache.org/jira/browse/HADOOP-13345), whose developers
include your colleague @ajfabbri. I am transitively testing Spark+ S3guard,
using this module and the [downstream set of examples which I pulled out from
this patch.](https://github.com/steveloughran/spark-cloud-examples). That
ensures that Spark becomes of the things where we can say "works".
S3guard will bring a consistent list to the API: if you create a file, then
do a list, it'll be there. This is going to be a prerequisite for the
zero-rename committer of
[HADOOP-13786](https://issues.apache.org/jira/browse/HADOOP-13786). That's
going allow anything under `FileOutputFormat` to writes direct to the
destination dir, supporting both speculation and failures. People will want
that. And they will only be able to use it if every dependency in their
packaging is consistent.
Another way to look at it is this: a very large percentage of people using
spark are doing it in-cloud deployments. There is no reliable way to get their
dependencies right in the ASF releases, which not only cripples it compared to
commercial products bundling spark in themselves, but makes it near-impossible
for developers to work at the maven level, building applications off the maven
dependency graph.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]