Re: [EXTERNAL] Accumulo with Native S3 Support

Christopher Wed, 28 Jul 2021 10:41:33 -0700

>From what I saw from looking at the changes in Chris Milbert's fork,
the fork contains a couple S3 implementations of Hadoop's FileSystem
interface in a separate module (similar to s3a:// and abfss://
implementations). It seems to add accS3mo:// and accS3nf://
implementations, which, in spite of their names, do not appear to be
Accumulo-specific (that's a good thing... as these could be reused by
other projects as well!).

In addition, these FileSystem implementations seem to be accompanied
by a few changes to Accumulo code itself, but I couldn't tell if these
were necessary to improve compatibility with these new FileSystems or
if they were unrelated additional enhancements to Accumulo. They also
appeared to be based on an older 2.0 branch, rather than the latest
2.1 / main branch, and conflict with some of the changes in 2.1
branch. So those changes will need to be rebased.

So, I suggest isolating the FileSystem implementations from the
changes to Accumulo. The FileSystem implementations don't need to be
merged into Accumulo's code base, or built as part of Accumulo at all.
They are completely independent from Accumulo and can exist in their
own repo, for use by any other user, just like s3a:// or abfss:// .
The Accumulo PMC could decide to accept responsibility for these
FileSystem implementations, but I don't think the Accumulo project at
the ASF is the best home for them, as they are not Accumulo-specific.
It might make more sense as a subproject of Hadoop instead of
Accumulo, since they are Hadoop FileSystem implementations, or remain
as a 3rd party repository on GitHub as part of the larger Hadoop
ecosystem. Finding the best home for these may take some additional
research on the part of its developers.

The changes to Accumulo itself, separate from the S3 FileSystem
implementations, will be easiest to incorporate into the 2.1 / main
branch if they are rebased first, and submitted from a fork on GitHub
(Chris Milbert's repo does not appear to be a "fork", but a
disconnected clone, so creating a PR using GitHub's UI won't be
possible without first recreating the repo using the "fork" feature on
GitHub). If there are multiple, discrete changes, serving independent
purposes, the changes should be teased apart and submitted as separate
PRs against the main branch, so they can be evaluated on their own
merits through the code review process. It is hard to consider their
merits without a pull request for those changes.

I think the discussion of abstracting the storage layer in Accumulo is
a worthy one, but I think it can be set aside for now. Abstracting the
storage layer from Hadoop would involve creating Accumulo-specific
storage APIs, and corralling Hadoop FileSystem API calls behind an
implementation of that Accumulo storage API. However, that's not
necessary for this. We currently use Hadoop's FileSystem APIs
throughout our own code, and Hadoop's FileSystem already provides
sufficient abstraction for the purposes of adding S3 support to
Accumulo, and that's what appears to have been done by Chris Milbert.
So, there's no need to complicate the discussion with additional
potential future work to further abstract Hadoop FileSystem API calls.
That abstraction doesn't appear to be a necessary prerequisite to
considering the work done by Chris in his repo.

To me, the main questions are:

1. Can the new FileSystem implementations be used as easily as other
drop-in implementations, like s3a:// and abfss:// ?
2. Where is the best home for these FileSystem implementations?
3. What benefits do the other changes to Accumulo serve, and can they
be rebased and submitted as separate PRs against Accumulo's main
branch?

On Tue, Jul 27, 2021 at 2:00 PM Arvind Shyamsundar
<[email protected]> wrote:
>
> Hi Jeff, what would be the difference between this path, and what can be 
> accomplished by using a Hadoop FileSystem interface based connector to talk 
> to S3? Is it because of the consistency limitations with s3a:// 
> (https://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/index.html)?
>
> As you probably know for Azure, we went with the abfss:// connector provided 
> as part of hadoop-azure 
> (https://hadoop.apache.org/docs/current/hadoop-azure/abfs.html) with minimal 
> effort. Just wondering what the key difference here is for S3.
>
> Thanks!
>
> Arvind.
>
> -----Original Message-----
> From: Jeff Kubina <[email protected]>
> Sent: Tuesday, July 27, 2021 10:16 AM
> To: [email protected]
> Subject: [EXTERNAL] Accumulo with Native S3 Support
>
> All,
>
> Some of AWS's back end services use a version of Accumulo modified to use 
> Amazon's S3 as its storage system. Amazon engineers forked Accumulo 2.0 and 
> merged that S3 support into it 
> <https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fcmilbert%2Faccumulo%2F&amp;data=04%7C01%7Carvindsh%40microsoft.com%7C9b8c533f2a85467b90c008d95122491f%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637630030450339294%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=WvhjAgkOZMRVM%2B2KzXH8ZvDU2ZsFxaw%2BFUPtupsNNbs%3D&amp;reserved=0>.
> Chris Milbert is the lead Amazon engineer who did the integration. Chris and 
> I would like to jump start the conversation about how best to initiate the 
> pull request for these changes into Accumulo 2.1.
>
> Mike Wall suggested using this as an opportunity to abstract out the storage 
> system of Accumulo and make it pluggable. He suggested the following broad 
> steps:
>
>    1. Identify all the things HDFS provides such as read, write,
>    replication and failover.
>    2. Abstract out a file system interface with hooks for all those things
>    (and does not require loading hadoop jars).
>    3. Plugin HDFS as the default implementation of that interface, hiding
>    all hadoop jars there.
>    4. Make another implementation that plugins in S3 and make it optionally
>    configured.
>    5. Run tests to make sure we didn't break things with HDFS.
>    6. Run tests to see if S3 meets all the requirements.
>
> Ed Coleman also suggested first forking Accumulo 2.1 and merging the S3 
> changes into it.
>
> Chris and I look forward to the discussion on how best to add S3 support to 
> Accumulo.
>
> Thanks,
> Jeff
> --
> Jeff Kubina

Re: [EXTERNAL] Accumulo with Native S3 Support

Reply via email to