See also the HCFS wiki page https://wiki.apache.org/hadoop/HCFS/Progress which attempts to explain this stuff for the community, maybe it needs some updates as well, i haven't looked in a while as ive moved onto working on other products nowadays
On Tue, Dec 8, 2015 at 12:50 PM, Steve Loughran <ste...@hortonworks.com> wrote: > > 1. do what chris says: go for the abstract contract tests. They'll find > the troublespots in your code, like the way seek(-1) appears to have > entertaining results, what happens on operations to closed files, etc, and > help identify where the semantics of your FS varies from HDFS. > > 2. You will need to stay with the versions of artifacts in the Hadoop > codebase. Troublespots there are protobuf (frozen @ 2.5) and guava > (shipping with 11.02, code must run against 18.x + if someone upgrades). If > this is problematic you may want discuss the versioning issues there with > your colleagues; see https://issues.apache.org/jira/browse/HADOOP-10101 > for the details. > > 3. the object stores get undertested: jenkins doesn't touch them for patch > review or nightly runs —you can't give jenkins the right credentials. > Setting up your own jenkins server to build the Hadoop versions and flag > problems would be a great contribution here. Also: help with the release > testing; if someone has a patch for the hadoop-gcs module, review and test > that too would be great; stops these patches being neglected. > > 4. We could do with some more scale tests of the object stores, to test > creating many thousands of small files, etc. Contributions welcome > > 5. We could do with a lot more downstream testing of things like hive & > spark IO on object stores, especially via ORC and Parquet. Helping to write > those tests would stop regressions in the stack, and help tune Hadoop for > your FS. > > 6. Finally: don't be afraid to get involved with the rest of the codebase. > It can only get better. > > > > On 8 Dec 2015, at 00:20, James Malone <jamesmal...@google.com.INVALID> > wrote: > > > > Haohui & Chris, > > > > Sounds great, thank you very much! We'll cut a JIRA once we get > everything > > lined up. > > > > Best, > > > > James > > > > On Mon, Dec 7, 2015 at 3:54 PM, Chris Nauroth <cnaur...@hortonworks.com> > > wrote: > > > >> Hi James, > >> > >> This sounds great! Thank you for considering contributing the code. > >> > >> Just seconding what Haohui said, there is existing precedent for > >> alternative implementations of the Hadoop FileSystem in our codebase. > We > >> currently have similar plugins for S3 [1], Azure [2] and OpenStack Swift > >> [3]. Additionally, we have a suite of FileSystem contract tests [4]. > >> These tests are designed to help developers of alternative file systems > >> assess how closely they match the semantics expected by Hadoop ecosystem > >> components. > >> > >> Many Hadoop users are accustomed to using HDFS instead of these > >> alternative file systems, so none of the alternatives are on the default > >> Hadoop classpath immediately after deployment. Instead, the code for > each > >> one is in a separate module under the "hadoop-tools" directory in the > >> source tree. Users who need to use the alternative file systems take > >> extra steps post-deployment to add them to the classpath where > necessary. > >> This achieves the dependency isolation needed. For example, users who > >> never use the Azure plugin won't accidentally pick up a transitive > >> dependency on the Azure SDK jar. > >> > >> I recommend taking a quick glance through the existing modules for S3, > >> Azure and OpenStack. We'll likely ask that a new FileSystem > >> implementation follow the same patterns if feasible for consistency. > This > >> would include things like using the contract tests, having a provision > to > >> execute tests both offline/mocked and live/integrated with the real > >> service and providing a documentation page that explains configuration > for > >> end users. > >> > >> For now, please feel free to file a HADOOP JIRA with your proposal. We > >> can work out the details of all of this in discussion on that JIRA. > >> > >> Something else to follow up on will be licensing concerns. I see the > >> project already uses the Apache license, but it appears to be an > existing > >> body of code initially developed at Google. That might require a > Software > >> Grant Agreement [5]. Again, this is something that can be hashed out in > >> discussion on the JIRA after you create it. > >> > >> [1] > >> > http://hadoop.apache.org/docs/r2.7.1/hadoop-aws/tools/hadoop-aws/index.html > >> [2] http://hadoop.apache.org/docs/r2.7.1/hadoop-azure/index.html > >> [3] http://hadoop.apache.org/docs/r2.7.1/hadoop-openstack/index.html > >> [4] > >> > http://hadoop.apache.org/docs/r2.7.1/hadoop-project-dist/hadoop-common/file > >> system/testing.html > >> [5] http://www.apache.org/licenses/ > >> > >> --Chris Nauroth > >> > >> > >> > >> > >> On 12/7/15, 3:10 PM, "Haohui Mai" <ricet...@gmail.com> wrote: > >> > >>> Hi, > >>> > >>> Thanks for reaching out. It would be great to see this in the Hadoop > >>> ecosystem. > >>> > >>> In Hadoop we have AWS S3 support. IMO they address similar use cases > >>> thus I think that it should be relatively straightforward to adopt the > >>> code. > >>> > >>> The only catch in my head right now is to properly isolate dependency. > >>> Not only the code needs to be put into a separate module, but many > >>> Hadoop applications also depend on different versions of Guava. I > >>> think it might be a problem that needs some attentions at the very > >>> beginning. > >>> > >>> Please feel free to reach out if you have any other questions. > >>> > >>> Regards, > >>> Haohui > >>> > >>> > >>> On Mon, Dec 7, 2015 at 2:35 PM, James Malone > >>> <jamesmal...@google.com.invalid> wrote: > >>>> Hello, > >>>> > >>>> We're from a team within Google Cloud Platform focused on OSS and data > >>>> technologies, especially Hadoop (and Spark.) Before we cut a JIRA for > >>>> something we¹d like to do, we wanted to reach out to this list to ask > a > >>>> two > >>>> quick questions, describe our proposed action, and check for any major > >>>> objections. > >>>> > >>>> Proposed action: > >>>> We have a Hadoop connector[1] (more info[2]) for Google Cloud Storage > >>>> (GCS) > >>>> which we have been building and maintaining for some time. After we > >>>> clean > >>>> up our code and tests to conform (to these[3] and other requirements) > we > >>>> would like to contribute it to Hadoop. We have many customers using > the > >>>> connector in high-throughput production Hadoop clusters; we¹d like to > >>>> make > >>>> it easier and faster to use Hadoop and GCS. > >>>> > >>>> Timeline: > >>>> Presently, we are working on the beta of Google Cloud Dataproc[4] > which > >>>> limits our time a bit, so we¹re targeting late Q1 2016 for creating a > >>>> JIRA > >>>> issue and adapting our connector code as needed. > >>>> > >>>> Our (quick) questions: > >>>> * Do we need to take any (non-coding) action for this beyond > submitting > >>>> a > >>>> JIRA when we are ready? > >>>> * Are there any up-front concerns or questions which we can (or will > >>>> need > >>>> to) address? > >>>> > >>>> Thank you! > >>>> > >>>> James Malone > >>>> On behalf of the Google Big Data OSS Engineering Team > >>>> > >>>> Links: > >>>> [1] - > >>>> > https://github.com/GoogleCloudPlatform/bigdata-interop/tree/master/gcs > >>>> [2] - https://cloud.google.com/hadoop/google-cloud-storage-connector > >>>> [3] - > >>>> > https://github.com/GoogleCloudPlatform/bigdata-interop/tree/master/gcs > >>>> [4] - https://cloud.google.com/dataproc > >>> > >> > >> > > -- jay vyas