Re: FileSystem API (was: Slack call notes)

Christopher Fri, 24 Apr 2020 19:32:14 -0700

I'm not familiar with it, but the website says it can replace HDFS.
There appears to be an "HDFS Gateway"
(https://github.com/minio/minio/blob/master/docs/gateway/hdfs.md) that
might be useful. At a glance, it looks like no abstraction is needed
in Accumulo code is needed for it... you just run the gateway and
Accumulo believes it is using HDFS, but it is really using MinIO
instead.


There also might be a Hadoop FileSystem implementation for it to use
it directly without a Gateway, but I didn't have any luck with a quick
search for one.

In either case, there shouldn't need to be any changes to Accumulo itself.

If changes to Accumulo do become necessary (or desired), I'd be
interested in collaborating on that part. If it's just a matter of
trying it with the Gateway or existing Hadoop FileSystem
implementation, I'd also be interested in testing any step-by-step
HOWTO guides somebody might want to write as a blog post.

On Fri, Apr 24, 2020 at 11:20 AM Mike Miller <[email protected]> wrote:
>
> I have no experience with MinIO but would be interested in learning more
> and collaborating.
>
> On Fri, Apr 24, 2020 at 10:57 AM Michael Wall <[email protected]> wrote:
>
> > Resurrecting this thread on the File System API.  I have been thinking
> > about giving Minio [1] a try for both WALs and RFiles.  Seems to me like
> > step one is to abstract internal interfaces for both targeted against 2.1?
> > Couple of questions
> >
> > 1 - Anyone have experience with minio?
> > 2 - Anyone interested in collaborating?  Thinking anything from providing
> > input to helping to test once we get a prototype to actually doing some
> > development.
> >
> > Thanks, hope everyone is staying safe and healthy.
> >
> > [1] - https://min.io/
> >
> > On Wed, Mar 25, 2020 at 6:08 PM Christopher <[email protected]> wrote:
> >
> > > Only 705 across 280 files, if you exclude Text, though :)
> > >
> > > grep -rP 'org[.]apache[.]hadoop(?![.]io[.]Text)' --include='*.java' *
> > > | grep -v test/ | wc -l
> > >
> > > On Wed, Mar 25, 2020 at 3:34 PM Mike Miller <[email protected]> wrote:
> > > >
> > > > I think we have come a long way removing any external types from the
> > API,
> > > > for reasons other than de-coupling from Hadoop.  While we don't have
> > many
> > > > dependencies on the other components of Hadoop, we are still very
> > tightly
> > > > coupled to HDFS.
> > > > For example, some quick grep'ing of the code shows:
> > > > "grep -r "import org.apache.hadoop" --include=*.java * | wc -l"
> > > > 1734
> > > > Without tests it is slightly more feasible...
> > > > grep -r "import org.apache.hadoop" --include=*.java * | grep -v "test"
> > |
> > > wc
> > > > -l
> > > > 858
> > > >
> > > >
> > > > On Wed, Mar 25, 2020 at 3:19 PM David Mollitor <[email protected]>
> > > wrote:
> > > >
> > > > > Hello,
> > > > >
> > > > > I too have been thinking about this for a pet project.  There is
> > > already
> > > > > Apache Commons VFS that, with some investment, could probably serve
> > all
> > > > > these requirements.
> > > > >
> > > > > On Wed, Mar 25, 2020, 3:16 PM Christopher <[email protected]>
> > wrote:
> > > > >
> > > > > > (Forking this thread, as it's a distinct topic)
> > > > > >
> > > > > > I've thought about it. The idea has driven me to try to reduce our
> > > use
> > > > > > of Hadoop-specific code, and to isolate Hadoop-specific stuff
> > behind
> > > > > > some abstraction, wherever possible. Though, I'll admit, we're
> > > nowhere
> > > > > > close to where we'd want to be to be fully decoupled from Hadoop.
> > > > > >
> > > > > > I've also been looking a lot at our VolumeManager code lately, to
> > try
> > > > > > to improve it a bit, and to create better abstractions for Volumes,
> > > > > > that could aid future work in this area.
> > > > > >
> > > > > > But, I haven't directly been working on new FileSystem API
> > > > > > abstraction... just trying to lay some groundwork for that
> > > possibility
> > > > > > in future.
> > > > > >
> > > > > > It'd be nice to get to a point where we have a Hadoop-specific
> > > > > > implementation isolated to a jar that can be swapped out at runtime
> > > > > > for other file system implementations, as needed. I see that as a
> > > > > > somewhat long-way off.
> > > > > >
> > > > > > On Wed, Mar 25, 2020 at 2:08 PM <[email protected]> wrote:
> > > > > > >
> > > > > > >
> > > > > > >   I couldn't make the call today, but am curious if anyone has
> > > > > > previously brought up creating a FileSystem API for Accumulo so
> > that
> > > we
> > > > > > could use implementations other than Hadoop. I realize that Hadoop
> > > > > provides
> > > > > > implementations for things other than HDFS but that doesn't
> > > necessarily
> > > > > > mean that all filesystem implementations are covered.
> > > > > > >
> > > > > > > -----Original Message-----
> > > > > > > From: Christopher <[email protected]>
> > > > > > > Sent: Wednesday, March 25, 2020 1:45 PM
> > > > > > > To: accumulo-dev <[email protected]>
> > > > > > > Subject: Slack call notes
> > > > > > >
> > > > > > > Several committers/contributors in the community joined a call in
> > > Slack
> > > > > > on Wednesday, at 1130-1230, New York (Eastern) time. Here are my
> > > notes of
> > > > > > the call. Please feel free to add to them.
> > > > > > >
> > > > > > > I shared the overall philosophy and backstory to some of the
> > script
> > > > > > improvements in 2.x to help guide current/future work on the
> > scripts.
> > > > > > >
> > > > > > > * bin/accumulo is inspired by old jpackage.org standards which
> > are
> > > > > > still in use in RPM macros for Java packaging in Fedora/RHEL/etc.
> > > The key
> > > > > > idea is that scripts are simple... set up environment (class path,
> > > etc.),
> > > > > > locate java, and exec a single process with the provided args.
> > > > > > > * bin/accumulo-service is inspired by old SysVInit scripts for
> > > > > > start/stop/restart/status of a single service
> > > > > > > * behavior of bin/accumulo and bin/accumulo-service can be
> > > manipulated
> > > > > > through launch environment
> > > > > > > * bin/accumulo-cluster uses bin/accumulo-service, and is provided
> > > as a
> > > > > > simple, out-of-the-box cluster management tool
> > > > > > > * bin/accumulo-cluster and bin/accumulo-service are replaceable;
> > > they
> > > > > > are useful for out-of-the-box, but one would expect them to be
> > > > > unnecessary
> > > > > > if using systemd, or a vendor-provided cluster management system
> > > > > > > * we discussed possibly moving bin/accumulo-cluster and
> > > > > > bin/accumulo-service to contrib/ in the tarball, or some subdir of
> > > bin/,
> > > > > > but it was suggested to not make too many disruptive changes there
> > > > > > > * we discussed the possibility of adding a config file for
> > > > > > bin/accumulo-cluster (also mentioned on
> > > > > > > https://github.com/apache/accumulo/pull/1568)
> > > > > > > * we discussed the need to document the intent/purpose/scope of
> > the
> > > > > > scripts in comments inside the scripts themselves
> > > > > > > * Ed Coleman asked if it'd be good to document a systemd
> > example; I
> > > > > > suggested it might make for a good blog post (perhaps by the person
> > > who
> > > > > > wrote the systemd unit files for Fluo Muchos)
> > > > > > >
> > > > > > > Keith Turner discussed his development efforts with regard to
> > > enabling
> > > > > > more controls over compactions.
> > > > > > >
> > > > > > > * one main idea was to keep configuration/API for data separate
> > > from
> > > > > > that for execution
> > > > > > > * data is concerns to application owners, whereas execution
> > > involves
> > > > > > system admins (resource contention, etc.)
> > > > > > > * he will submit a PR for review when ready
> > > > > > > * he also suggested another call to go over the PR
> > > > > > >
> > > > > > > Billie Rinaldi discussed better support for Azure Data Lake
> > Storage
> > > > > > > Gen2 (ADLSv2).
> > > > > > >
> > > > > > > * maintaining a fork for experimenting, and working on reliably
> > > testing
> > > > > > issues involving WALs
> > > > > > > * did not recommend using ADLSv2 with WALs, but that we should
> > > still
> > > > > > support it
> > > > > > > * might need to implement a custom log closer to better support
> > it
> > > > > > >
> > > > > > > Mike Miller brought up the idea of eliminating more static
> > internal
> > > > > > state.
> > > > > > >
> > > > > > > * ServerConfigurationFactory might be improved in this regard,
> > with
> > > > > some
> > > > > > additional ZK cleanup
> > > > > > > * Other ZK cleanup might help elsewhere (such as ZooCache)
> > > > > > > * I suggested tablet location cache might also benefit from being
> > > bound
> > > > > > to an AccumuloClient lifecycle (or a dedicated opaque object that
> > > could
> > > > > be
> > > > > > shared across AccumuloClient instances with its own user-managed
> > > > > lifecycle)
> > > > > > >
> > > > > > > Please add anything I might have missed (or got wrong) in
> > response
> > > to
> > > > > > this post.
> > > > > > >
> > > > > >
> > > > >
> > >
> >

Re: FileSystem API (was: Slack call notes)

Reply via email to