Hi, Steve. Here is the PR for publishing it as a part of Apache Spark 3.2.0+
https://github.com/apache/spark/pull/33003 Bests, Dongjoon. On 2021/06/01 17:09:53, Steve Loughran <ste...@cloudera.com.INVALID> wrote: > (can't reply to user@, so pulling @dev instead. sorry) > > (can't reply to user@, so pulling @dev instead) > > There is no fundamental reason why the hadoop-cloud POM and artifact isn't > built/released by the ASF spark project; I think the effort it took to get > the spark-hadoop-cloud module it in at all was enough to put me off trying > to get the artifact released. > > Including the AWS SDK in the spark tarball the main thing to question. > > It does contain some minimal binding classes to deal with two issues, both > of which are actually fixable if anyone sat down to do it. > > > 1. Spark using mapreduce V1 APIs (org.apache.hadoop.mapred) vs v2 > ((org.apache.hadoop.mapredreduce.{input, output,... }). That's fixable in > spark; a shim class was just a lot less traumatic. > 2. Parquet being fussy about writing to a subclass of > ParquetOutputCommitter. Again, a shim does that, alternative is a fix in > Parquet. Or I modify the original Hadoop FileOutputCommitter to actually > wrap/forward to a new committer. I chose not not to do that from the outset > because that class scares me. Nothing has changed my opinion there. FWIW > EMR just did their S3-only committer as a subclass of > ParquetOutputCommitter. Simpler solution if you don't have to care about > other committers for other stores. > > Move spark to MRv2 APIs and Parquet lib to downgrade if the committer isn't > a subclass (it wants the option to call writeMetaDataFile()), and the need > for those shims goes away. > > What the module also does is import the relevant hadoop-aws, hadoop-azure > modules etc and strip out anything which complicates life. When published > to the maven repo then, apps can import it downstream and get a consistent > set of hadooop-* artifacts, and the AWS artifacts which they've been > compiled and tested with. > > They are published by both cloudera and palantir; it'd be really good for > the world as a whole if the ASF published them too, in sync with the rest > of the release > > https://mvnrepository.com/artifact/org.apache.spark/spark-hadoop-cloud > > > There's one other aspect of the module, which is when it is built the spark > distribution includes the AWS SDK bundle, which is a few hundred MB and > growing. > > Why use the whole shaded JAR?" Classpaths. Jackson versions, httpclient > versions, etc: if they weren't shaded it'd be very hard to get a consistent > set of dependencies. There's the side benefit of having one consistent set > of AWS libraries, so spark-kinesis will be in sync with s3a client, > DynamoDB client, etc, etc. ( > https://issues.apache.org/jira/browse/HADOOP-17197 ) > > There's a very good case for excluding that SDK from the distro unless you > are confident people really want it. Instead just say "this release > contains all the ASF dependencies needed to work with AWS, just add > "aws-sdk-bundle 1.11.XYZ". > > I'm happy to work on that if I can get some promise of review time from > others. > > On related notes > > Hadoop 3.3.1 RCs are up for testing. For S3A this includes everything in > https://issues.apache.org/jira/browse/HADOOP-16829 big speedups in list > calls, and you can turn off deletion of dir marking for significant IO > gains/reduced throttling. Do play ASAP, do complain on issues: this is your > last chance before things ship. > > For everything else, yes, many benefits. And, courtesy of Huawei, native > ARM support too. Your VM cost/hour just went down for all workloads where > you don't need GPUs. > > *The RC2 artifacts are at*: > https://home.apache.org/~weichiu/hadoop-3.3.1-RC2/ > ARM artifacts: https://home.apache.org/~weichiu/hadoop-3.3.1-RC2-arm/ > > > *The maven artifacts are hosted here:* > https://repository.apache.org/content/repositories/orgapachehadoop-1318/ > > > Independent of that, anyone working on Azure or GCS who wants spark to > write output in a classic Hive partitioned directory structure -there's a > WiP committer which promises speed and correctness even when the store > (GCS) doesn't do atomic dir renames. > > https://github.com/apache/hadoop/pull/2971 > > Reviews and testing with private datasets strongly encouraged, and I'd love > to get the IOStatistics parts of the _SUCCESS files to see what happened. > This committer measures time to list/rename/mkdir in task and job commit, > and aggregates them all into the final report. > > -Steve > > On Mon, 31 May 2021 at 13:35, Sean Owen <sro...@gmail.com> wrote: > > > I know it's not enabled by default when the binary artifacts are built, > > but not exactly sure why it's not built separately at all. It's almost a > > dependencies-only pom artifact, but there are two source files. Steve do > > you have an angle on that? > > > > On Mon, May 31, 2021 at 5:37 AM Erik Torres <etserr...@gmail.com> wrote: > > > >> Hi, > >> > >> I'm following this documentation > >> <https://spark.apache.org/docs/latest/cloud-integration.html#installation> > >> to > >> configure my Spark-based application to interact with Amazon S3. However, I > >> cannot find the spark-hadoop-cloud module in Maven central for the > >> non-commercial distribution of Apache Spark. From the documentation I would > >> expect that I can get this module as a Maven dependency in my project. > >> However, I ended up building the spark-hadoop-cloud module from the Spark's > >> code <https://github.com/apache/spark>. > >> > >> Is this the expected way to setup the integration with Amazon S3? I think > >> I'm missing something here. > >> > > > --------------------------------------------------------------------- To unsubscribe e-mail: dev-unsubscr...@spark.apache.org