Re: Google Cloud Storage connector into Hadoop

Chris Nauroth Mon, 07 Dec 2015 15:59:53 -0800

Hi James,

This sounds great!  Thank you for considering contributing the code.

Just seconding what Haohui said, there is existing precedent for
alternative implementations of the Hadoop FileSystem in our codebase.  We
currently have similar plugins for S3 [1], Azure [2] and OpenStack Swift
[3].  Additionally, we have a suite of FileSystem contract tests [4].
These tests are designed to help developers of alternative file systems
assess how closely they match the semantics expected by Hadoop ecosystem
components.

Many Hadoop users are accustomed to using HDFS instead of these
alternative file systems, so none of the alternatives are on the default
Hadoop classpath immediately after deployment.  Instead, the code for each
one is in a separate module under the "hadoop-tools" directory in the
source tree.  Users who need to use the alternative file systems take
extra steps post-deployment to add them to the classpath where necessary.
This achieves the dependency isolation needed.  For example, users who
never use the Azure plugin won't accidentally pick up a transitive
dependency on the Azure SDK jar.

I recommend taking a quick glance through the existing modules for S3,
Azure and OpenStack.  We'll likely ask that a new FileSystem
implementation follow the same patterns if feasible for consistency.  This
would include things like using the contract tests, having a provision to
execute tests both offline/mocked and live/integrated with the real
service and providing a documentation page that explains configuration for
end users.

For now, please feel free to file a HADOOP JIRA with your proposal.  We
can work out the details of all of this in discussion on that JIRA.

Something else to follow up on will be licensing concerns.  I see the
project already uses the Apache license, but it appears to be an existing
body of code initially developed at Google.  That might require a Software
Grant Agreement [5].  Again, this is something that can be hashed out in
discussion on the JIRA after you create it.

[1] 
http://hadoop.apache.org/docs/r2.7.1/hadoop-aws/tools/hadoop-aws/index.html
[2] http://hadoop.apache.org/docs/r2.7.1/hadoop-azure/index.html
[3] http://hadoop.apache.org/docs/r2.7.1/hadoop-openstack/index.html
[4] 
http://hadoop.apache.org/docs/r2.7.1/hadoop-project-dist/hadoop-common/file
system/testing.html
[5] http://www.apache.org/licenses/

--Chris Nauroth

On 12/7/15, 3:10 PM, "Haohui Mai" <[email protected]> wrote:

>Hi,
>
>Thanks for reaching out. It would be great to see this in the Hadoop
>ecosystem.
>
>In Hadoop we have AWS S3 support. IMO they address similar use cases
>thus I think that it should be relatively straightforward to adopt the
>code.
>
>The only catch in my head right now is to properly isolate dependency.
>Not only the code needs to be put into a separate module, but many
>Hadoop applications also depend on different versions of Guava. I
>think it might be a problem that needs some attentions at the very
>beginning.
>
>Please feel free to reach out if you have any other questions.
>
>Regards,
>Haohui
>
>
>On Mon, Dec 7, 2015 at 2:35 PM, James Malone
><[email protected]> wrote:
>> Hello,
>>
>> We're from a team within Google Cloud Platform focused on OSS and data
>> technologies, especially Hadoop (and Spark.) Before we cut a JIRA for
>> something we¹d like to do, we wanted to reach out to this list to ask a
>>two
>> quick questions, describe our proposed action, and check for any major
>> objections.
>>
>> Proposed action:
>> We have a Hadoop connector[1] (more info[2]) for Google Cloud Storage
>>(GCS)
>> which we have been building and maintaining for some time. After we
>>clean
>> up our code and tests to conform (to these[3] and other requirements) we
>> would like to contribute it to Hadoop. We have many customers using the
>> connector in high-throughput production Hadoop clusters; we¹d like to
>>make
>> it easier and faster to use Hadoop and GCS.
>>
>> Timeline:
>> Presently, we are working on the beta of Google Cloud Dataproc[4] which
>> limits our time a bit, so we¹re targeting late Q1 2016 for creating a
>>JIRA
>> issue and adapting our connector code as needed.
>>
>> Our (quick) questions:
>> * Do we need to take any (non-coding) action for this beyond submitting
>>a
>> JIRA when we are ready?
>> * Are there any up-front concerns or questions which we can (or will
>>need
>> to) address?
>>
>> Thank you!
>>
>> James Malone
>> On behalf of the Google Big Data OSS Engineering Team
>>
>> Links:
>> [1] - 
>>https://github.com/GoogleCloudPlatform/bigdata-interop/tree/master/gcs
>> [2] - https://cloud.google.com/hadoop/google-cloud-storage-connector
>> [3] - 
>>https://github.com/GoogleCloudPlatform/bigdata-interop/tree/master/gcs
>> [4] - https://cloud.google.com/dataproc
>

Re: Google Cloud Storage connector into Hadoop

Reply via email to