(can't reply to user@, so pulling @dev instead. sorry) (can't reply to user@, so pulling @dev instead)
There is no fundamental reason why the hadoop-cloud POM and artifact isn't built/released by the ASF spark project; I think the effort it took to get the spark-hadoop-cloud module it in at all was enough to put me off trying to get the artifact released. Including the AWS SDK in the spark tarball the main thing to question. It does contain some minimal binding classes to deal with two issues, both of which are actually fixable if anyone sat down to do it. 1. Spark using mapreduce V1 APIs (org.apache.hadoop.mapred) vs v2 ((org.apache.hadoop.mapredreduce.{input, output,... }). That's fixable in spark; a shim class was just a lot less traumatic. 2. Parquet being fussy about writing to a subclass of ParquetOutputCommitter. Again, a shim does that, alternative is a fix in Parquet. Or I modify the original Hadoop FileOutputCommitter to actually wrap/forward to a new committer. I chose not not to do that from the outset because that class scares me. Nothing has changed my opinion there. FWIW EMR just did their S3-only committer as a subclass of ParquetOutputCommitter. Simpler solution if you don't have to care about other committers for other stores. Move spark to MRv2 APIs and Parquet lib to downgrade if the committer isn't a subclass (it wants the option to call writeMetaDataFile()), and the need for those shims goes away. What the module also does is import the relevant hadoop-aws, hadoop-azure modules etc and strip out anything which complicates life. When published to the maven repo then, apps can import it downstream and get a consistent set of hadooop-* artifacts, and the AWS artifacts which they've been compiled and tested with. They are published by both cloudera and palantir; it'd be really good for the world as a whole if the ASF published them too, in sync with the rest of the release https://mvnrepository.com/artifact/org.apache.spark/spark-hadoop-cloud There's one other aspect of the module, which is when it is built the spark distribution includes the AWS SDK bundle, which is a few hundred MB and growing. Why use the whole shaded JAR?" Classpaths. Jackson versions, httpclient versions, etc: if they weren't shaded it'd be very hard to get a consistent set of dependencies. There's the side benefit of having one consistent set of AWS libraries, so spark-kinesis will be in sync with s3a client, DynamoDB client, etc, etc. ( https://issues.apache.org/jira/browse/HADOOP-17197 ) There's a very good case for excluding that SDK from the distro unless you are confident people really want it. Instead just say "this release contains all the ASF dependencies needed to work with AWS, just add "aws-sdk-bundle 1.11.XYZ". I'm happy to work on that if I can get some promise of review time from others. On related notes Hadoop 3.3.1 RCs are up for testing. For S3A this includes everything in https://issues.apache.org/jira/browse/HADOOP-16829 big speedups in list calls, and you can turn off deletion of dir marking for significant IO gains/reduced throttling. Do play ASAP, do complain on issues: this is your last chance before things ship. For everything else, yes, many benefits. And, courtesy of Huawei, native ARM support too. Your VM cost/hour just went down for all workloads where you don't need GPUs. *The RC2 artifacts are at*: https://home.apache.org/~weichiu/hadoop-3.3.1-RC2/ ARM artifacts: https://home.apache.org/~weichiu/hadoop-3.3.1-RC2-arm/ *The maven artifacts are hosted here:* https://repository.apache.org/content/repositories/orgapachehadoop-1318/ Independent of that, anyone working on Azure or GCS who wants spark to write output in a classic Hive partitioned directory structure -there's a WiP committer which promises speed and correctness even when the store (GCS) doesn't do atomic dir renames. https://github.com/apache/hadoop/pull/2971 Reviews and testing with private datasets strongly encouraged, and I'd love to get the IOStatistics parts of the _SUCCESS files to see what happened. This committer measures time to list/rename/mkdir in task and job commit, and aggregates them all into the final report. -Steve On Mon, 31 May 2021 at 13:35, Sean Owen <sro...@gmail.com> wrote: > I know it's not enabled by default when the binary artifacts are built, > but not exactly sure why it's not built separately at all. It's almost a > dependencies-only pom artifact, but there are two source files. Steve do > you have an angle on that? > > On Mon, May 31, 2021 at 5:37 AM Erik Torres <etserr...@gmail.com> wrote: > >> Hi, >> >> I'm following this documentation >> <https://spark.apache.org/docs/latest/cloud-integration.html#installation> to >> configure my Spark-based application to interact with Amazon S3. However, I >> cannot find the spark-hadoop-cloud module in Maven central for the >> non-commercial distribution of Apache Spark. From the documentation I would >> expect that I can get this module as a Maven dependency in my project. >> However, I ended up building the spark-hadoop-cloud module from the Spark's >> code <https://github.com/apache/spark>. >> >> Is this the expected way to setup the integration with Amazon S3? I think >> I'm missing something here. >> >