Re: Missing module spark-hadoop-cloud in Maven central

Dongjoon Hyun Mon, 21 Jun 2021 10:01:27 -0700

Hi, Steve.

Here is the PR for publishing it as a part of Apache Spark 3.2.0+


https://github.com/apache/spark/pull/33003

Bests,
Dongjoon.

On 2021/06/01 17:09:53, Steve Loughran <ste...@cloudera.com.INVALID> wrote: 
> (can't reply to user@, so pulling @dev instead. sorry)
> 
> (can't reply to user@, so pulling @dev instead)
> 
> There is no fundamental reason why the hadoop-cloud POM and artifact isn't
> built/released by the ASF spark project; I think the effort it took to get
> the spark-hadoop-cloud module it in at all was enough to put me off trying
> to get the artifact released.
> 
> Including the AWS SDK in the spark tarball the main thing to question.
> 
> It does contain some minimal binding classes to deal with two issues, both
> of which are actually fixable if anyone sat down to do it.
> 
> 
>    1. Spark using mapreduce V1 APIs (org.apache.hadoop.mapred) vs v2
>    ((org.apache.hadoop.mapredreduce.{input, output,... }). That's fixable in
>    spark; a shim class was just a lot less traumatic.
>    2. Parquet being fussy about writing to a subclass of
>    ParquetOutputCommitter. Again, a shim does that, alternative is a fix in
>    Parquet. Or I modify the original Hadoop FileOutputCommitter to actually
>    wrap/forward to a new committer. I chose not not to do that from the outset
>    because that class scares me. Nothing has changed my opinion there. FWIW
>    EMR just did their S3-only committer as a subclass of
>    ParquetOutputCommitter. Simpler solution if you don't have to care about
>    other committers for other stores.
> 
> Move spark to MRv2 APIs and Parquet lib to downgrade if the committer isn't
> a subclass (it wants the option to call writeMetaDataFile()), and the need
> for those shims goes away.
> 
> What the module also does is import the relevant hadoop-aws, hadoop-azure
> modules etc and strip out anything which complicates life. When published
> to the maven repo then, apps can import it downstream and get a consistent
> set of hadooop-* artifacts, and the AWS artifacts which they've been
> compiled and tested with.
> 
> They are published by both cloudera and palantir; it'd be really good for
> the world as a whole if the ASF published them too, in sync with the rest
> of the release
> 
> https://mvnrepository.com/artifact/org.apache.spark/spark-hadoop-cloud
> 
> 
> There's one other aspect of the module, which is when it is built the spark
> distribution includes the AWS SDK bundle, which is a few hundred MB and
> growing.
> 
> Why use the whole shaded JAR?" Classpaths. Jackson versions, httpclient
> versions, etc: if they weren't shaded it'd be very hard to get a consistent
> set of dependencies. There's the side benefit of having one consistent set
> of AWS libraries, so spark-kinesis will be in sync with s3a client,
> DynamoDB client, etc, etc. (
> https://issues.apache.org/jira/browse/HADOOP-17197 )
> 
> There's a very good case for excluding that SDK from the distro unless you
> are confident people really want it. Instead just say "this release
> contains all the ASF dependencies needed to work with AWS, just add
> "aws-sdk-bundle 1.11.XYZ".
> 
> I'm happy to work on that if I can get some promise of review time from
> others.
> 
> On related notes
> 
> Hadoop 3.3.1 RCs are up for testing. For S3A this includes everything in
> https://issues.apache.org/jira/browse/HADOOP-16829   big speedups in list
> calls, and you can turn off deletion of dir marking for significant IO
> gains/reduced throttling. Do play ASAP, do complain on issues: this is your
> last chance before things ship.
> 
> For everything else, yes, many benefits. And, courtesy of Huawei, native
> ARM support too. Your VM cost/hour just went down for all workloads where
> you don't need GPUs.
> 
> *The RC2 artifacts are at*:
> https://home.apache.org/~weichiu/hadoop-3.3.1-RC2/
> ARM artifacts: https://home.apache.org/~weichiu/hadoop-3.3.1-RC2-arm/
> 
> 
> *The maven artifacts are hosted here:*
> https://repository.apache.org/content/repositories/orgapachehadoop-1318/
> 
> 
> Independent of that, anyone working on Azure or GCS who wants spark to
> write output in a classic Hive partitioned directory structure -there's a
> WiP committer which promises speed and correctness even when the store
> (GCS) doesn't do atomic dir renames.
> 
> https://github.com/apache/hadoop/pull/2971
> 
> Reviews and testing with private datasets strongly encouraged, and I'd love
> to get the IOStatistics parts of the _SUCCESS files to see what happened.
> This committer measures time to list/rename/mkdir in task and job commit,
> and aggregates them all into the final report.
> 
> -Steve
> 
> On Mon, 31 May 2021 at 13:35, Sean Owen <sro...@gmail.com> wrote:
> 
> > I know it's not enabled by default when the binary artifacts are built,
> > but not exactly sure why it's not built separately at all. It's almost a
> > dependencies-only pom artifact, but there are two source files. Steve do
> > you have an angle on that?
> >
> > On Mon, May 31, 2021 at 5:37 AM Erik Torres <etserr...@gmail.com> wrote:
> >
> >> Hi,
> >>
> >> I'm following this documentation
> >> <https://spark.apache.org/docs/latest/cloud-integration.html#installation> 
> >> to
> >> configure my Spark-based application to interact with Amazon S3. However, I
> >> cannot find the spark-hadoop-cloud module in Maven central for the
> >> non-commercial distribution of Apache Spark. From the documentation I would
> >> expect that I can get this module as a Maven dependency in my project.
> >> However, I ended up building the spark-hadoop-cloud module from the Spark's
> >> code <https://github.com/apache/spark>.
> >>
> >> Is this the expected way to setup the integration with Amazon S3? I think
> >> I'm missing something here.
> >>
> >
> 

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: Missing module spark-hadoop-cloud in Maven central

Reply via email to