Github user steveloughran commented on the issue:
https://github.com/apache/spark/pull/12004
# Packaging:
1. this addresses the problem that it's not always immediately obvious to
people what they have to do to get, say s3a working. Do you know precisely
which version of amazon-aws-SDK you need to have on your CP for a specific
version of hadoo-aws.jar to avoid getting a linkage error? That's the problem
maven handles for you.
1. with a new module. it lets downstream applications build with that
support, knowing that issues related to dependency versions have been handled
for them.
# Documentation
It has an overview of how to use this stuff, lists those dependencies,
explains whether they can be used as a direct destination for work, why the
Direct committer was taken away, etc.
# Testing
The tests makes sure everything works. That's the packaging, the versioning
of jackson, propagation of configuration options, failure handling, etc. Which
offers:
1. Verifying the packaging. The initial role of the tests was to make sure
the classpaths were coming in right, filesystems registering, etc.
1. Compliance testing of the object stores client libraries: have they
implemented the relevant APIs the way they are meant to, so that Spark can use
them to list, read, write data.
1. Regression testing of the hadoop client libs: functionality and
performance. This module, along with some Hive stuff, is the basis for
benchmarking S3A performance improvements.
1. Regression testing of spark functionality/performance; highlighting
places to tune stuff like directory listing operations.
1. Regression testing of cloud infras themselves. More relevant with
Openstack than the others, as that's the ones where you can go against nightly
builds.
1. Cross object store benchmarking. Compare how long it takes the dataframe
example to complete in Azure vs S3a, and crank up the debugging to see where
the delays are (it's in s3 copy being way, way slower; looks like Azure is not
actually copying bytes).
1. Integration testing. That is, rather than just do a minimal scalatest
operation, you can use spark-submit to submit the work to a full cluster, so
verify that the right JARs made it out, the cluster isn't running incompatible
versions of the JVM and joda time, etc, etc.
With this module, then, people get the option of building Spark with the
JARs on the CP. But they also gain the ability to have Jenkins set up to make
sure that everything works, all the time.
It also provides the placeholder to add any code specific to object stores,
like, perhaps some kind of committer. I don't have any plans there, but others
might.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]