You can use all the cloud stores as the destination of work; just get them on the classpath and use: s3a://, gs://, wasb:/
To use s3 as a destination with performance and consistency, you need the S3A committers (see the hadoop-aws docs), and to safely chain work, the S3Guard consistency tool. To use gcs or azure storage you can use as a destination as is. see: https://docs.hortonworks.com/HDPDocuments/HDP3/HDP-3.1.0/bk_cloud-data-access/content/about.html If you can use distcp as a source or dest, you can use MR On Thu, Jun 13, 2019 at 8:09 AM Amit Kabra <amitkabrai...@gmail.com> wrote: > Hello, > > I have a requirement where I need to read/write data to public cloud via > map reduce job. > > Our systems currently read and write of data from hdfs using mapreduce and > its working well, we write data in sequencefile format. > > We might have to move data to public cloud i.e s3 / gcp. Where everything > remains same just we do read/write to s3/gcp > > I did quick search for gcp and I didn't get much info on doing mapreduce > directly from it. GCS connector for hadoop > < > https://cloudplatform.googleblog.com/2014/01/performance-advantages-of-the-new-google-cloud-storage-connector-for-hadoop.html > > > looks closest but I didn't find any map reduce sample for the same. > > Any help on where to start for it or is it not even possible say s3/gcp > outputformat > < > https://github.com/apache/hadoop/tree/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/output > > > not there ,etc and we need to do some hack. > > Thanks, > Amit Kabra. >