Re: Mapreduce to and from public clouds

Steve Loughran Tue, 18 Jun 2019 15:38:19 -0700

You can use all the cloud stores as the destination of work; just get them
on the classpath and use: s3a://, gs://, wasb:/


To use s3 as a destination with performance and consistency, you need the
S3A committers (see the hadoop-aws docs), and to safely chain work, the
S3Guard consistency tool.

To use gcs or azure storage you can use as a destination as is.

see:
https://docs.hortonworks.com/HDPDocuments/HDP3/HDP-3.1.0/bk_cloud-data-access/content/about.html



If you can use distcp as a source or dest, you can use MR

On Thu, Jun 13, 2019 at 8:09 AM Amit Kabra <amitkabrai...@gmail.com> wrote:

> Hello,
>
> I have a requirement where I need to read/write data to public cloud via
> map reduce job.
>
> Our systems currently read and write of data from hdfs using mapreduce and
> its working well, we write data in sequencefile format.
>
> We might have to move data to public cloud i.e s3 / gcp. Where everything
> remains same just we do read/write to s3/gcp
>
> I did quick search for gcp and I didn't get much info on doing mapreduce
> directly from it. GCS connector for hadoop
> <
> https://cloudplatform.googleblog.com/2014/01/performance-advantages-of-the-new-google-cloud-storage-connector-for-hadoop.html
> >
> looks closest but I didn't find any map reduce sample for the same.
>
> Any help on where to start for it or is it not even possible say s3/gcp
> outputformat
> <
> https://github.com/apache/hadoop/tree/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/output
> >
> not there ,etc and we need to do some hack.
>
> Thanks,
> Amit Kabra.
>

Re: Mapreduce to and from public clouds

Reply via email to