Re: Use VFS with (Put|List|.*)HDFS

Matt Burgess Mon, 30 May 2016 07:57:07 -0700

As a former Pentaho employee, I can add some details around this:

- Pentaho does/did have a fork of Apache VFS. Mostly it was the
application of bugs in the 2.x line against the 1.x codebase. Since
then they have contributed fixes from their branch back to the 2.x
fork, and when I left they were hoping for an upcoming release of
Apache VFS to include their fixes (so they didn't have to depend on a
local or SNAPSHOT build). IIRC the Apache VFS project was not under
(very) active development at the time. Perhaps with fixes for Accumulo
this has changed.


- Pentaho's HDFS and MapRFS providers are not part of the VFS
fork/project, nor do I (or did we) believe they should be. The
dependencies for these providers are pretty big, and specific to the
Hadoop ecosystem. Having them as separate drop-in modules is a better
idea IMO. Also, the Hadoop JARs are "provided" dependencies of these
modules (so Hadoop is not included), this is so the same VFS provider
can (hopefully) be used against different versions of Hadoop. I guess
in that sense they could be added to Apache VFS proper, but then it's
on the VFS project to handle the other issues (see below).

- Including the HDFS or MapRFS provider along with Apache VFS is not
enough to make it work. Pentaho had a classloading configuration where
they would bundle all the necessary code, JARs, configs, etc. into
what they call a "shim". They have shims for multiple versions of
multiple distributions of Hadoop (HDP, CDH, MapR, etc.).  To support
this in NiFi, we would need some mechanism to get the right JARs from
the right places.  If NiFi is on a cluster node, then the correct
Hadoop libraries are available already. In any case, it probably would
only work as a "Bring Your Own Hadoop". Currently NiFi includes a
vanilla version of Hadoop such that it can exist outside the Hadoop
cluster and still use client libraries/protocols to move data into
Hadoop. With the advent of the Extension Registry, the Bring Your Own
Hadoop solution becomes more viable.

- Even with a Bring Your Own Hadoop solution, we'd need to isolate
that functionality behind a service for the purpose of allowing
multiple Hadoop vendors, versions, etc.  That would allow such things
as a migration from one vendor's Hadoop to another. The alternatives
include WebHDFS and HttpFS as you mentioned (if they are enabled on
the system), and possibly Knox and/or Falcon depending on your use
case. As of now, Pentaho still only supports one Vendor/Version of
Hadoop at a time, though this has been logged for improvement
(http://jira.pentaho.com/browse/PDI-8121).

- MapR has a somewhat different architecture than many other vendors.
For example, they have a native client the user must install on each
node intending to communicate with Hadoop. The JARs are installed in a
specific location and thus NiFi would need a way for the user to point
to that location. This is hopefully just a configuration issue (i.e.
where to point the Bring Your Own Hadoop solution), but I believe they
also have a different security model, so I imagine there will be some
MapR-specific processing to be done around Kerberos, e.g.

There is a NiFi Jira that covers multiple Hadoop versions
(https://issues.apache.org/jira/browse/NIFI-710), hopefully good
discussion like this will inform that case (and end up in the
comments).

Regards,
Matt

P.S. Mr. Rosander :) If you're reading this, I'd love to get your
comments on this thread as well, thanks!

On Sun, May 29, 2016 at 10:41 AM, Jim Hughes <[email protected]> wrote:
> Hi Andre,
>
> Your plan seems reasonable to me.  The shortest path to verifying it might
> be to drop in the pentaho-hdfs-vfs artifacts and remove any conflicting VFS
> providers (or just their config (1)) from the Apache VFS jars.
>
> Some of the recent effort in VFS has been to address bugs which were
> relevant to Apache Accumulo.  From hearing about that, it sounds like VFS
> may have a little smaller team.
>
> That said, it might be worth asking the Pentaho folks 1) if they could
> contribute their project to VFS and 2) how they leverage it. They might have
> some guidance about how to use their project as replacement for the HDFS
> parts of VFS.
>
> Good luck!
>
> Jim
>
> 1.  I'd peek at files like this:
> https://github.com/pentaho/pentaho-hdfs-vfs/blob/master/res/META-INF/vfs-providers.xml.
>
> On 5/29/2016 8:10 AM, Andre wrote:
>>
>> All,
>>
>> Not sure how many other MapR users are effectively using NiFi (I only know
>> two others) but as you may remember from old threads that integrating some
>> different flavours of HDFS compatible APIs can sometimes be puzzling and
>> require recompilation of bundles.
>>
>> However, recompilation doesn't solve scenarios where for whatever reason,
>> a
>> user may want to use more than one HDFS provider (e.g. MapR + HDP, or
>> Isilon + MapR) and HDFS version are distinct (e.g.
>>
>> While WebHDFS and HttpFs are good palliative solutions to some of this
>> issue, they have their own limitations, the more striking ones being the
>> need to create Kerberos proxy users to run those services [1] and
>> potential
>> bottlenecks [2].
>>
>> I was wondering if we could tap into the work Pentaho did around using a
>> fork of Apache VFS as an option to solve this issue and also to unify the
>> .*MapR and .*HDFS processors.[*]
>>
>> Pentaho's code is Apache Licensed and is available here:
>>
>>
>> https://github.com/pentaho/pentaho-hdfs-vfs/blob/master/src/org/pentaho/hdfs/vfs/
>>
>> As you can see, VFS acts as middle man between the application and the API
>> being used to access the "HDFS" backend. I used Pentaho before and know
>> that this functionality happens to work reasonably well.
>>
>>
>> Any thoughts ?
>>
>>
>>
>>
>>
>> [1] required if file ownership does not equal the user running the API
>> endpoint
>> [2] HttpFs
>> [*] Ideally VFS upstream could be offered a PR to address this but not
>> sure
>> how feasible it to achieve this.
>>
>

Re: Use VFS with (Put|List|.*)HDFS

Reply via email to