>From what I saw from looking at the changes in Chris Milbert's fork, the fork contains a couple S3 implementations of Hadoop's FileSystem interface in a separate module (similar to s3a:// and abfss:// implementations). It seems to add accS3mo:// and accS3nf:// implementations, which, in spite of their names, do not appear to be Accumulo-specific (that's a good thing... as these could be reused by other projects as well!).
In addition, these FileSystem implementations seem to be accompanied by a few changes to Accumulo code itself, but I couldn't tell if these were necessary to improve compatibility with these new FileSystems or if they were unrelated additional enhancements to Accumulo. They also appeared to be based on an older 2.0 branch, rather than the latest 2.1 / main branch, and conflict with some of the changes in 2.1 branch. So those changes will need to be rebased. So, I suggest isolating the FileSystem implementations from the changes to Accumulo. The FileSystem implementations don't need to be merged into Accumulo's code base, or built as part of Accumulo at all. They are completely independent from Accumulo and can exist in their own repo, for use by any other user, just like s3a:// or abfss:// . The Accumulo PMC could decide to accept responsibility for these FileSystem implementations, but I don't think the Accumulo project at the ASF is the best home for them, as they are not Accumulo-specific. It might make more sense as a subproject of Hadoop instead of Accumulo, since they are Hadoop FileSystem implementations, or remain as a 3rd party repository on GitHub as part of the larger Hadoop ecosystem. Finding the best home for these may take some additional research on the part of its developers. The changes to Accumulo itself, separate from the S3 FileSystem implementations, will be easiest to incorporate into the 2.1 / main branch if they are rebased first, and submitted from a fork on GitHub (Chris Milbert's repo does not appear to be a "fork", but a disconnected clone, so creating a PR using GitHub's UI won't be possible without first recreating the repo using the "fork" feature on GitHub). If there are multiple, discrete changes, serving independent purposes, the changes should be teased apart and submitted as separate PRs against the main branch, so they can be evaluated on their own merits through the code review process. It is hard to consider their merits without a pull request for those changes. I think the discussion of abstracting the storage layer in Accumulo is a worthy one, but I think it can be set aside for now. Abstracting the storage layer from Hadoop would involve creating Accumulo-specific storage APIs, and corralling Hadoop FileSystem API calls behind an implementation of that Accumulo storage API. However, that's not necessary for this. We currently use Hadoop's FileSystem APIs throughout our own code, and Hadoop's FileSystem already provides sufficient abstraction for the purposes of adding S3 support to Accumulo, and that's what appears to have been done by Chris Milbert. So, there's no need to complicate the discussion with additional potential future work to further abstract Hadoop FileSystem API calls. That abstraction doesn't appear to be a necessary prerequisite to considering the work done by Chris in his repo. To me, the main questions are: 1. Can the new FileSystem implementations be used as easily as other drop-in implementations, like s3a:// and abfss:// ? 2. Where is the best home for these FileSystem implementations? 3. What benefits do the other changes to Accumulo serve, and can they be rebased and submitted as separate PRs against Accumulo's main branch? On Tue, Jul 27, 2021 at 2:00 PM Arvind Shyamsundar <arvin...@microsoft.com.invalid> wrote: > > Hi Jeff, what would be the difference between this path, and what can be > accomplished by using a Hadoop FileSystem interface based connector to talk > to S3? Is it because of the consistency limitations with s3a:// > (https://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/index.html)? > > As you probably know for Azure, we went with the abfss:// connector provided > as part of hadoop-azure > (https://hadoop.apache.org/docs/current/hadoop-azure/abfs.html) with minimal > effort. Just wondering what the key difference here is for S3. > > Thanks! > > Arvind. > > -----Original Message----- > From: Jeff Kubina <jeff.kub...@gmail.com> > Sent: Tuesday, July 27, 2021 10:16 AM > To: dev@accumulo.apache.org > Subject: [EXTERNAL] Accumulo with Native S3 Support > > All, > > Some of AWS's back end services use a version of Accumulo modified to use > Amazon's S3 as its storage system. Amazon engineers forked Accumulo 2.0 and > merged that S3 support into it > <https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fcmilbert%2Faccumulo%2F&data=04%7C01%7Carvindsh%40microsoft.com%7C9b8c533f2a85467b90c008d95122491f%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637630030450339294%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=WvhjAgkOZMRVM%2B2KzXH8ZvDU2ZsFxaw%2BFUPtupsNNbs%3D&reserved=0>. > Chris Milbert is the lead Amazon engineer who did the integration. Chris and > I would like to jump start the conversation about how best to initiate the > pull request for these changes into Accumulo 2.1. > > Mike Wall suggested using this as an opportunity to abstract out the storage > system of Accumulo and make it pluggable. He suggested the following broad > steps: > > 1. Identify all the things HDFS provides such as read, write, > replication and failover. > 2. Abstract out a file system interface with hooks for all those things > (and does not require loading hadoop jars). > 3. Plugin HDFS as the default implementation of that interface, hiding > all hadoop jars there. > 4. Make another implementation that plugins in S3 and make it optionally > configured. > 5. Run tests to make sure we didn't break things with HDFS. > 6. Run tests to see if S3 meets all the requirements. > > Ed Coleman also suggested first forking Accumulo 2.1 and merging the S3 > changes into it. > > Chris and I look forward to the discussion on how best to add S3 support to > Accumulo. > > Thanks, > Jeff > -- > Jeff Kubina