Re: Google Cloud Storage connector into Hadoop

jay vyas Fri, 08 Jan 2016 15:34:30 -0800

See also the HCFS wiki page https://wiki.apache.org/hadoop/HCFS/Progress
which attempts to explain this stuff for the community, maybe it needs some
updates as well, i haven't looked in a while as ive moved onto working on
other products nowadays





On Tue, Dec 8, 2015 at 12:50 PM, Steve Loughran <ste...@hortonworks.com>
wrote:

>
> 1. do what chris says: go for the abstract contract tests. They'll find
> the troublespots in your code, like the way seek(-1) appears to have
> entertaining results, what happens on operations to closed files, etc, and
> help identify where the semantics of your FS varies from HDFS.
>
> 2. You will need to stay with the versions of artifacts in the Hadoop
> codebase. Troublespots there are protobuf (frozen @ 2.5) and guava
> (shipping with 11.02, code must run against 18.x + if someone upgrades). If
> this is problematic you may want discuss the versioning issues there with
> your colleagues; see https://issues.apache.org/jira/browse/HADOOP-10101
> for the details.
>
> 3. the object stores get undertested: jenkins doesn't touch them for patch
> review or nightly runs —you can't give jenkins the right credentials.
> Setting up your own jenkins server to build the Hadoop versions and flag
> problems would be a great contribution here. Also: help with the release
> testing; if someone has a patch for the hadoop-gcs module, review and test
> that too would be great; stops these patches being neglected.
>
> 4. We could do with some more scale tests of the object stores, to test
> creating many thousands of small files, etc. Contributions welcome
>
> 5. We could do with a lot more downstream testing of things like hive &
> spark IO on object stores, especially via ORC and Parquet. Helping to write
> those tests would stop regressions in the stack, and help tune Hadoop for
> your FS.
>
> 6. Finally: don't be afraid to get involved with the rest of the codebase.
> It can only get better.
>
>
> > On 8 Dec 2015, at 00:20, James Malone <jamesmal...@google.com.INVALID>
> wrote:
> >
> > Haohui & Chris,
> >
> > Sounds great, thank you very much! We'll cut a JIRA once we get
> everything
> > lined up.
> >
> > Best,
> >
> > James
> >
> > On Mon, Dec 7, 2015 at 3:54 PM, Chris Nauroth <cnaur...@hortonworks.com>
> > wrote:
> >
> >> Hi James,
> >>
> >> This sounds great!  Thank you for considering contributing the code.
> >>
> >> Just seconding what Haohui said, there is existing precedent for
> >> alternative implementations of the Hadoop FileSystem in our codebase.
> We
> >> currently have similar plugins for S3 [1], Azure [2] and OpenStack Swift
> >> [3].  Additionally, we have a suite of FileSystem contract tests [4].
> >> These tests are designed to help developers of alternative file systems
> >> assess how closely they match the semantics expected by Hadoop ecosystem
> >> components.
> >>
> >> Many Hadoop users are accustomed to using HDFS instead of these
> >> alternative file systems, so none of the alternatives are on the default
> >> Hadoop classpath immediately after deployment.  Instead, the code for
> each
> >> one is in a separate module under the "hadoop-tools" directory in the
> >> source tree.  Users who need to use the alternative file systems take
> >> extra steps post-deployment to add them to the classpath where
> necessary.
> >> This achieves the dependency isolation needed.  For example, users who
> >> never use the Azure plugin won't accidentally pick up a transitive
> >> dependency on the Azure SDK jar.
> >>
> >> I recommend taking a quick glance through the existing modules for S3,
> >> Azure and OpenStack.  We'll likely ask that a new FileSystem
> >> implementation follow the same patterns if feasible for consistency.
> This
> >> would include things like using the contract tests, having a provision
> to
> >> execute tests both offline/mocked and live/integrated with the real
> >> service and providing a documentation page that explains configuration
> for
> >> end users.
> >>
> >> For now, please feel free to file a HADOOP JIRA with your proposal.  We
> >> can work out the details of all of this in discussion on that JIRA.
> >>
> >> Something else to follow up on will be licensing concerns.  I see the
> >> project already uses the Apache license, but it appears to be an
> existing
> >> body of code initially developed at Google.  That might require a
> Software
> >> Grant Agreement [5].  Again, this is something that can be hashed out in
> >> discussion on the JIRA after you create it.
> >>
> >> [1]
> >>
> http://hadoop.apache.org/docs/r2.7.1/hadoop-aws/tools/hadoop-aws/index.html
> >> [2] http://hadoop.apache.org/docs/r2.7.1/hadoop-azure/index.html
> >> [3] http://hadoop.apache.org/docs/r2.7.1/hadoop-openstack/index.html
> >> [4]
> >>
> http://hadoop.apache.org/docs/r2.7.1/hadoop-project-dist/hadoop-common/file
> >> system/testing.html
> >> [5] http://www.apache.org/licenses/
> >>
> >> --Chris Nauroth
> >>
> >>
> >>
> >>
> >> On 12/7/15, 3:10 PM, "Haohui Mai" <ricet...@gmail.com> wrote:
> >>
> >>> Hi,
> >>>
> >>> Thanks for reaching out. It would be great to see this in the Hadoop
> >>> ecosystem.
> >>>
> >>> In Hadoop we have AWS S3 support. IMO they address similar use cases
> >>> thus I think that it should be relatively straightforward to adopt the
> >>> code.
> >>>
> >>> The only catch in my head right now is to properly isolate dependency.
> >>> Not only the code needs to be put into a separate module, but many
> >>> Hadoop applications also depend on different versions of Guava. I
> >>> think it might be a problem that needs some attentions at the very
> >>> beginning.
> >>>
> >>> Please feel free to reach out if you have any other questions.
> >>>
> >>> Regards,
> >>> Haohui
> >>>
> >>>
> >>> On Mon, Dec 7, 2015 at 2:35 PM, James Malone
> >>> <jamesmal...@google.com.invalid> wrote:
> >>>> Hello,
> >>>>
> >>>> We're from a team within Google Cloud Platform focused on OSS and data
> >>>> technologies, especially Hadoop (and Spark.) Before we cut a JIRA for
> >>>> something we¹d like to do, we wanted to reach out to this list to ask
> a
> >>>> two
> >>>> quick questions, describe our proposed action, and check for any major
> >>>> objections.
> >>>>
> >>>> Proposed action:
> >>>> We have a Hadoop connector[1] (more info[2]) for Google Cloud Storage
> >>>> (GCS)
> >>>> which we have been building and maintaining for some time. After we
> >>>> clean
> >>>> up our code and tests to conform (to these[3] and other requirements)
> we
> >>>> would like to contribute it to Hadoop. We have many customers using
> the
> >>>> connector in high-throughput production Hadoop clusters; we¹d like to
> >>>> make
> >>>> it easier and faster to use Hadoop and GCS.
> >>>>
> >>>> Timeline:
> >>>> Presently, we are working on the beta of Google Cloud Dataproc[4]
> which
> >>>> limits our time a bit, so we¹re targeting late Q1 2016 for creating a
> >>>> JIRA
> >>>> issue and adapting our connector code as needed.
> >>>>
> >>>> Our (quick) questions:
> >>>> * Do we need to take any (non-coding) action for this beyond
> submitting
> >>>> a
> >>>> JIRA when we are ready?
> >>>> * Are there any up-front concerns or questions which we can (or will
> >>>> need
> >>>> to) address?
> >>>>
> >>>> Thank you!
> >>>>
> >>>> James Malone
> >>>> On behalf of the Google Big Data OSS Engineering Team
> >>>>
> >>>> Links:
> >>>> [1] -
> >>>>
> https://github.com/GoogleCloudPlatform/bigdata-interop/tree/master/gcs
> >>>> [2] - https://cloud.google.com/hadoop/google-cloud-storage-connector
> >>>> [3] -
> >>>>
> https://github.com/GoogleCloudPlatform/bigdata-interop/tree/master/gcs
> >>>> [4] - https://cloud.google.com/dataproc
> >>>
> >>
> >>
>
>


-- 
jay vyas

Re: Google Cloud Storage connector into Hadoop

Reply via email to