Re: FileSystem API (was: Slack call notes)

Mike Miller Wed, 25 Mar 2020 12:34:18 -0700

I think we have come a long way removing any external types from the API,
for reasons other than de-coupling from Hadoop.  While we don't have many
dependencies on the other components of Hadoop, we are still very tightly
coupled to HDFS.
For example, some quick grep'ing of the code shows:
"grep -r "import org.apache.hadoop" --include=*.java * | wc -l"
1734
Without tests it is slightly more feasible...
grep -r "import org.apache.hadoop" --include=*.java * | grep -v "test" | wc
-l
858



On Wed, Mar 25, 2020 at 3:19 PM David Mollitor <[email protected]> wrote:

> Hello,
>
> I too have been thinking about this for a pet project.  There is already
> Apache Commons VFS that, with some investment, could probably serve all
> these requirements.
>
> On Wed, Mar 25, 2020, 3:16 PM Christopher <[email protected]> wrote:
>
> > (Forking this thread, as it's a distinct topic)
> >
> > I've thought about it. The idea has driven me to try to reduce our use
> > of Hadoop-specific code, and to isolate Hadoop-specific stuff behind
> > some abstraction, wherever possible. Though, I'll admit, we're nowhere
> > close to where we'd want to be to be fully decoupled from Hadoop.
> >
> > I've also been looking a lot at our VolumeManager code lately, to try
> > to improve it a bit, and to create better abstractions for Volumes,
> > that could aid future work in this area.
> >
> > But, I haven't directly been working on new FileSystem API
> > abstraction... just trying to lay some groundwork for that possibility
> > in future.
> >
> > It'd be nice to get to a point where we have a Hadoop-specific
> > implementation isolated to a jar that can be swapped out at runtime
> > for other file system implementations, as needed. I see that as a
> > somewhat long-way off.
> >
> > On Wed, Mar 25, 2020 at 2:08 PM <[email protected]> wrote:
> > >
> > >
> > >   I couldn't make the call today, but am curious if anyone has
> > previously brought up creating a FileSystem API for Accumulo so that we
> > could use implementations other than Hadoop. I realize that Hadoop
> provides
> > implementations for things other than HDFS but that doesn't necessarily
> > mean that all filesystem implementations are covered.
> > >
> > > -----Original Message-----
> > > From: Christopher <[email protected]>
> > > Sent: Wednesday, March 25, 2020 1:45 PM
> > > To: accumulo-dev <[email protected]>
> > > Subject: Slack call notes
> > >
> > > Several committers/contributors in the community joined a call in Slack
> > on Wednesday, at 1130-1230, New York (Eastern) time. Here are my notes of
> > the call. Please feel free to add to them.
> > >
> > > I shared the overall philosophy and backstory to some of the script
> > improvements in 2.x to help guide current/future work on the scripts.
> > >
> > > * bin/accumulo is inspired by old jpackage.org standards which are
> > still in use in RPM macros for Java packaging in Fedora/RHEL/etc. The key
> > idea is that scripts are simple... set up environment (class path, etc.),
> > locate java, and exec a single process with the provided args.
> > > * bin/accumulo-service is inspired by old SysVInit scripts for
> > start/stop/restart/status of a single service
> > > * behavior of bin/accumulo and bin/accumulo-service can be manipulated
> > through launch environment
> > > * bin/accumulo-cluster uses bin/accumulo-service, and is provided as a
> > simple, out-of-the-box cluster management tool
> > > * bin/accumulo-cluster and bin/accumulo-service are replaceable; they
> > are useful for out-of-the-box, but one would expect them to be
> unnecessary
> > if using systemd, or a vendor-provided cluster management system
> > > * we discussed possibly moving bin/accumulo-cluster and
> > bin/accumulo-service to contrib/ in the tarball, or some subdir of bin/,
> > but it was suggested to not make too many disruptive changes there
> > > * we discussed the possibility of adding a config file for
> > bin/accumulo-cluster (also mentioned on
> > > https://github.com/apache/accumulo/pull/1568)
> > > * we discussed the need to document the intent/purpose/scope of the
> > scripts in comments inside the scripts themselves
> > > * Ed Coleman asked if it'd be good to document a systemd example; I
> > suggested it might make for a good blog post (perhaps by the person who
> > wrote the systemd unit files for Fluo Muchos)
> > >
> > > Keith Turner discussed his development efforts with regard to enabling
> > more controls over compactions.
> > >
> > > * one main idea was to keep configuration/API for data separate from
> > that for execution
> > > * data is concerns to application owners, whereas execution involves
> > system admins (resource contention, etc.)
> > > * he will submit a PR for review when ready
> > > * he also suggested another call to go over the PR
> > >
> > > Billie Rinaldi discussed better support for Azure Data Lake Storage
> > > Gen2 (ADLSv2).
> > >
> > > * maintaining a fork for experimenting, and working on reliably testing
> > issues involving WALs
> > > * did not recommend using ADLSv2 with WALs, but that we should still
> > support it
> > > * might need to implement a custom log closer to better support it
> > >
> > > Mike Miller brought up the idea of eliminating more static internal
> > state.
> > >
> > > * ServerConfigurationFactory might be improved in this regard, with
> some
> > additional ZK cleanup
> > > * Other ZK cleanup might help elsewhere (such as ZooCache)
> > > * I suggested tablet location cache might also benefit from being bound
> > to an AccumuloClient lifecycle (or a dedicated opaque object that could
> be
> > shared across AccumuloClient instances with its own user-managed
> lifecycle)
> > >
> > > Please add anything I might have missed (or got wrong) in response to
> > this post.
> > >
> >
>

Re: FileSystem API (was: Slack call notes)

Reply via email to