Re: Google Cloud Storage connector into Hadoop

James Malone Mon, 07 Dec 2015 16:25:34 -0800

Haohui & Chris,

Sounds great, thank you very much! We'll cut a JIRA once we get everything
lined up.


Best,

James

On Mon, Dec 7, 2015 at 3:54 PM, Chris Nauroth <cnaur...@hortonworks.com>
wrote:

> Hi James,
>
> This sounds great!  Thank you for considering contributing the code.
>
> Just seconding what Haohui said, there is existing precedent for
> alternative implementations of the Hadoop FileSystem in our codebase.  We
> currently have similar plugins for S3 [1], Azure [2] and OpenStack Swift
> [3].  Additionally, we have a suite of FileSystem contract tests [4].
> These tests are designed to help developers of alternative file systems
> assess how closely they match the semantics expected by Hadoop ecosystem
> components.
>
> Many Hadoop users are accustomed to using HDFS instead of these
> alternative file systems, so none of the alternatives are on the default
> Hadoop classpath immediately after deployment.  Instead, the code for each
> one is in a separate module under the "hadoop-tools" directory in the
> source tree.  Users who need to use the alternative file systems take
> extra steps post-deployment to add them to the classpath where necessary.
> This achieves the dependency isolation needed.  For example, users who
> never use the Azure plugin won't accidentally pick up a transitive
> dependency on the Azure SDK jar.
>
> I recommend taking a quick glance through the existing modules for S3,
> Azure and OpenStack.  We'll likely ask that a new FileSystem
> implementation follow the same patterns if feasible for consistency.  This
> would include things like using the contract tests, having a provision to
> execute tests both offline/mocked and live/integrated with the real
> service and providing a documentation page that explains configuration for
> end users.
>
> For now, please feel free to file a HADOOP JIRA with your proposal.  We
> can work out the details of all of this in discussion on that JIRA.
>
> Something else to follow up on will be licensing concerns.  I see the
> project already uses the Apache license, but it appears to be an existing
> body of code initially developed at Google.  That might require a Software
> Grant Agreement [5].  Again, this is something that can be hashed out in
> discussion on the JIRA after you create it.
>
> [1]
> http://hadoop.apache.org/docs/r2.7.1/hadoop-aws/tools/hadoop-aws/index.html
> [2] http://hadoop.apache.org/docs/r2.7.1/hadoop-azure/index.html
> [3] http://hadoop.apache.org/docs/r2.7.1/hadoop-openstack/index.html
> [4]
> http://hadoop.apache.org/docs/r2.7.1/hadoop-project-dist/hadoop-common/file
> system/testing.html
> [5] http://www.apache.org/licenses/
>
> --Chris Nauroth
>
>
>
>
> On 12/7/15, 3:10 PM, "Haohui Mai" <ricet...@gmail.com> wrote:
>
> >Hi,
> >
> >Thanks for reaching out. It would be great to see this in the Hadoop
> >ecosystem.
> >
> >In Hadoop we have AWS S3 support. IMO they address similar use cases
> >thus I think that it should be relatively straightforward to adopt the
> >code.
> >
> >The only catch in my head right now is to properly isolate dependency.
> >Not only the code needs to be put into a separate module, but many
> >Hadoop applications also depend on different versions of Guava. I
> >think it might be a problem that needs some attentions at the very
> >beginning.
> >
> >Please feel free to reach out if you have any other questions.
> >
> >Regards,
> >Haohui
> >
> >
> >On Mon, Dec 7, 2015 at 2:35 PM, James Malone
> ><jamesmal...@google.com.invalid> wrote:
> >> Hello,
> >>
> >> We're from a team within Google Cloud Platform focused on OSS and data
> >> technologies, especially Hadoop (and Spark.) Before we cut a JIRA for
> >> something we¹d like to do, we wanted to reach out to this list to ask a
> >>two
> >> quick questions, describe our proposed action, and check for any major
> >> objections.
> >>
> >> Proposed action:
> >> We have a Hadoop connector[1] (more info[2]) for Google Cloud Storage
> >>(GCS)
> >> which we have been building and maintaining for some time. After we
> >>clean
> >> up our code and tests to conform (to these[3] and other requirements) we
> >> would like to contribute it to Hadoop. We have many customers using the
> >> connector in high-throughput production Hadoop clusters; we¹d like to
> >>make
> >> it easier and faster to use Hadoop and GCS.
> >>
> >> Timeline:
> >> Presently, we are working on the beta of Google Cloud Dataproc[4] which
> >> limits our time a bit, so we¹re targeting late Q1 2016 for creating a
> >>JIRA
> >> issue and adapting our connector code as needed.
> >>
> >> Our (quick) questions:
> >> * Do we need to take any (non-coding) action for this beyond submitting
> >>a
> >> JIRA when we are ready?
> >> * Are there any up-front concerns or questions which we can (or will
> >>need
> >> to) address?
> >>
> >> Thank you!
> >>
> >> James Malone
> >> On behalf of the Google Big Data OSS Engineering Team
> >>
> >> Links:
> >> [1] -
> >>https://github.com/GoogleCloudPlatform/bigdata-interop/tree/master/gcs
> >> [2] - https://cloud.google.com/hadoop/google-cloud-storage-connector
> >> [3] -
> >>https://github.com/GoogleCloudPlatform/bigdata-interop/tree/master/gcs
> >> [4] - https://cloud.google.com/dataproc
> >
>
>

Re: Google Cloud Storage connector into Hadoop

Reply via email to