Re: Google Cloud Storage connector into Hadoop

Steve Loughran Tue, 08 Dec 2015 09:51:45 -0800

1. do what chris says: go for the abstract contract tests. They'll find the 
troublespots in your code, like the way seek(-1) appears to have entertaining 
results, what happens on operations to closed files, etc, and help identify 
where the semantics of your FS varies from HDFS.


2. You will need to stay with the versions of artifacts in the Hadoop codebase. 
Troublespots there are protobuf (frozen @ 2.5) and guava (shipping with 11.02, 
code must run against 18.x + if someone upgrades). If this is problematic you 
may want discuss the versioning issues there with your colleagues; see 
https://issues.apache.org/jira/browse/HADOOP-10101 for the details.

3. the object stores get undertested: jenkins doesn't touch them for patch 
review or nightly runs —you can't give jenkins the right credentials. Setting 
up your own jenkins server to build the Hadoop versions and flag problems would 
be a great contribution here. Also: help with the release testing; if someone 
has a patch for the hadoop-gcs module, review and test that too would be great; 
stops these patches being neglected.

4. We could do with some more scale tests of the object stores, to test 
creating many thousands of small files, etc. Contributions welcome

5. We could do with a lot more downstream testing of things like hive & spark 
IO on object stores, especially via ORC and Parquet. Helping to write those 
tests would stop regressions in the stack, and help tune Hadoop for your FS.

6. Finally: don't be afraid to get involved with the rest of the codebase. It 
can only get better.


> On 8 Dec 2015, at 00:20, James Malone <[email protected]> wrote:
> 
> Haohui & Chris,
> 
> Sounds great, thank you very much! We'll cut a JIRA once we get everything
> lined up.
> 
> Best,
> 
> James
> 
> On Mon, Dec 7, 2015 at 3:54 PM, Chris Nauroth <[email protected]>
> wrote:
> 
>> Hi James,
>> 
>> This sounds great!  Thank you for considering contributing the code.
>> 
>> Just seconding what Haohui said, there is existing precedent for
>> alternative implementations of the Hadoop FileSystem in our codebase.  We
>> currently have similar plugins for S3 [1], Azure [2] and OpenStack Swift
>> [3].  Additionally, we have a suite of FileSystem contract tests [4].
>> These tests are designed to help developers of alternative file systems
>> assess how closely they match the semantics expected by Hadoop ecosystem
>> components.
>> 
>> Many Hadoop users are accustomed to using HDFS instead of these
>> alternative file systems, so none of the alternatives are on the default
>> Hadoop classpath immediately after deployment.  Instead, the code for each
>> one is in a separate module under the "hadoop-tools" directory in the
>> source tree.  Users who need to use the alternative file systems take
>> extra steps post-deployment to add them to the classpath where necessary.
>> This achieves the dependency isolation needed.  For example, users who
>> never use the Azure plugin won't accidentally pick up a transitive
>> dependency on the Azure SDK jar.
>> 
>> I recommend taking a quick glance through the existing modules for S3,
>> Azure and OpenStack.  We'll likely ask that a new FileSystem
>> implementation follow the same patterns if feasible for consistency.  This
>> would include things like using the contract tests, having a provision to
>> execute tests both offline/mocked and live/integrated with the real
>> service and providing a documentation page that explains configuration for
>> end users.
>> 
>> For now, please feel free to file a HADOOP JIRA with your proposal.  We
>> can work out the details of all of this in discussion on that JIRA.
>> 
>> Something else to follow up on will be licensing concerns.  I see the
>> project already uses the Apache license, but it appears to be an existing
>> body of code initially developed at Google.  That might require a Software
>> Grant Agreement [5].  Again, this is something that can be hashed out in
>> discussion on the JIRA after you create it.
>> 
>> [1]
>> http://hadoop.apache.org/docs/r2.7.1/hadoop-aws/tools/hadoop-aws/index.html
>> [2] http://hadoop.apache.org/docs/r2.7.1/hadoop-azure/index.html
>> [3] http://hadoop.apache.org/docs/r2.7.1/hadoop-openstack/index.html
>> [4]
>> http://hadoop.apache.org/docs/r2.7.1/hadoop-project-dist/hadoop-common/file
>> system/testing.html
>> [5] http://www.apache.org/licenses/
>> 
>> --Chris Nauroth
>> 
>> 
>> 
>> 
>> On 12/7/15, 3:10 PM, "Haohui Mai" <[email protected]> wrote:
>> 
>>> Hi,
>>> 
>>> Thanks for reaching out. It would be great to see this in the Hadoop
>>> ecosystem.
>>> 
>>> In Hadoop we have AWS S3 support. IMO they address similar use cases
>>> thus I think that it should be relatively straightforward to adopt the
>>> code.
>>> 
>>> The only catch in my head right now is to properly isolate dependency.
>>> Not only the code needs to be put into a separate module, but many
>>> Hadoop applications also depend on different versions of Guava. I
>>> think it might be a problem that needs some attentions at the very
>>> beginning.
>>> 
>>> Please feel free to reach out if you have any other questions.
>>> 
>>> Regards,
>>> Haohui
>>> 
>>> 
>>> On Mon, Dec 7, 2015 at 2:35 PM, James Malone
>>> <[email protected]> wrote:
>>>> Hello,
>>>> 
>>>> We're from a team within Google Cloud Platform focused on OSS and data
>>>> technologies, especially Hadoop (and Spark.) Before we cut a JIRA for
>>>> something we¹d like to do, we wanted to reach out to this list to ask a
>>>> two
>>>> quick questions, describe our proposed action, and check for any major
>>>> objections.
>>>> 
>>>> Proposed action:
>>>> We have a Hadoop connector[1] (more info[2]) for Google Cloud Storage
>>>> (GCS)
>>>> which we have been building and maintaining for some time. After we
>>>> clean
>>>> up our code and tests to conform (to these[3] and other requirements) we
>>>> would like to contribute it to Hadoop. We have many customers using the
>>>> connector in high-throughput production Hadoop clusters; we¹d like to
>>>> make
>>>> it easier and faster to use Hadoop and GCS.
>>>> 
>>>> Timeline:
>>>> Presently, we are working on the beta of Google Cloud Dataproc[4] which
>>>> limits our time a bit, so we¹re targeting late Q1 2016 for creating a
>>>> JIRA
>>>> issue and adapting our connector code as needed.
>>>> 
>>>> Our (quick) questions:
>>>> * Do we need to take any (non-coding) action for this beyond submitting
>>>> a
>>>> JIRA when we are ready?
>>>> * Are there any up-front concerns or questions which we can (or will
>>>> need
>>>> to) address?
>>>> 
>>>> Thank you!
>>>> 
>>>> James Malone
>>>> On behalf of the Google Big Data OSS Engineering Team
>>>> 
>>>> Links:
>>>> [1] -
>>>> https://github.com/GoogleCloudPlatform/bigdata-interop/tree/master/gcs
>>>> [2] - https://cloud.google.com/hadoop/google-cloud-storage-connector
>>>> [3] -
>>>> https://github.com/GoogleCloudPlatform/bigdata-interop/tree/master/gcs
>>>> [4] - https://cloud.google.com/dataproc
>>> 
>> 
>>

Re: Google Cloud Storage connector into Hadoop

Reply via email to