As a former Pentaho employee, I can add some details around this: - Pentaho does/did have a fork of Apache VFS. Mostly it was the application of bugs in the 2.x line against the 1.x codebase. Since then they have contributed fixes from their branch back to the 2.x fork, and when I left they were hoping for an upcoming release of Apache VFS to include their fixes (so they didn't have to depend on a local or SNAPSHOT build). IIRC the Apache VFS project was not under (very) active development at the time. Perhaps with fixes for Accumulo this has changed.
- Pentaho's HDFS and MapRFS providers are not part of the VFS fork/project, nor do I (or did we) believe they should be. The dependencies for these providers are pretty big, and specific to the Hadoop ecosystem. Having them as separate drop-in modules is a better idea IMO. Also, the Hadoop JARs are "provided" dependencies of these modules (so Hadoop is not included), this is so the same VFS provider can (hopefully) be used against different versions of Hadoop. I guess in that sense they could be added to Apache VFS proper, but then it's on the VFS project to handle the other issues (see below). - Including the HDFS or MapRFS provider along with Apache VFS is not enough to make it work. Pentaho had a classloading configuration where they would bundle all the necessary code, JARs, configs, etc. into what they call a "shim". They have shims for multiple versions of multiple distributions of Hadoop (HDP, CDH, MapR, etc.). To support this in NiFi, we would need some mechanism to get the right JARs from the right places. If NiFi is on a cluster node, then the correct Hadoop libraries are available already. In any case, it probably would only work as a "Bring Your Own Hadoop". Currently NiFi includes a vanilla version of Hadoop such that it can exist outside the Hadoop cluster and still use client libraries/protocols to move data into Hadoop. With the advent of the Extension Registry, the Bring Your Own Hadoop solution becomes more viable. - Even with a Bring Your Own Hadoop solution, we'd need to isolate that functionality behind a service for the purpose of allowing multiple Hadoop vendors, versions, etc. That would allow such things as a migration from one vendor's Hadoop to another. The alternatives include WebHDFS and HttpFS as you mentioned (if they are enabled on the system), and possibly Knox and/or Falcon depending on your use case. As of now, Pentaho still only supports one Vendor/Version of Hadoop at a time, though this has been logged for improvement (http://jira.pentaho.com/browse/PDI-8121). - MapR has a somewhat different architecture than many other vendors. For example, they have a native client the user must install on each node intending to communicate with Hadoop. The JARs are installed in a specific location and thus NiFi would need a way for the user to point to that location. This is hopefully just a configuration issue (i.e. where to point the Bring Your Own Hadoop solution), but I believe they also have a different security model, so I imagine there will be some MapR-specific processing to be done around Kerberos, e.g. There is a NiFi Jira that covers multiple Hadoop versions (https://issues.apache.org/jira/browse/NIFI-710), hopefully good discussion like this will inform that case (and end up in the comments). Regards, Matt P.S. Mr. Rosander :) If you're reading this, I'd love to get your comments on this thread as well, thanks! On Sun, May 29, 2016 at 10:41 AM, Jim Hughes <[email protected]> wrote: > Hi Andre, > > Your plan seems reasonable to me. The shortest path to verifying it might > be to drop in the pentaho-hdfs-vfs artifacts and remove any conflicting VFS > providers (or just their config (1)) from the Apache VFS jars. > > Some of the recent effort in VFS has been to address bugs which were > relevant to Apache Accumulo. From hearing about that, it sounds like VFS > may have a little smaller team. > > That said, it might be worth asking the Pentaho folks 1) if they could > contribute their project to VFS and 2) how they leverage it. They might have > some guidance about how to use their project as replacement for the HDFS > parts of VFS. > > Good luck! > > Jim > > 1. I'd peek at files like this: > https://github.com/pentaho/pentaho-hdfs-vfs/blob/master/res/META-INF/vfs-providers.xml. > > On 5/29/2016 8:10 AM, Andre wrote: >> >> All, >> >> Not sure how many other MapR users are effectively using NiFi (I only know >> two others) but as you may remember from old threads that integrating some >> different flavours of HDFS compatible APIs can sometimes be puzzling and >> require recompilation of bundles. >> >> However, recompilation doesn't solve scenarios where for whatever reason, >> a >> user may want to use more than one HDFS provider (e.g. MapR + HDP, or >> Isilon + MapR) and HDFS version are distinct (e.g. >> >> While WebHDFS and HttpFs are good palliative solutions to some of this >> issue, they have their own limitations, the more striking ones being the >> need to create Kerberos proxy users to run those services [1] and >> potential >> bottlenecks [2]. >> >> I was wondering if we could tap into the work Pentaho did around using a >> fork of Apache VFS as an option to solve this issue and also to unify the >> .*MapR and .*HDFS processors.[*] >> >> Pentaho's code is Apache Licensed and is available here: >> >> >> https://github.com/pentaho/pentaho-hdfs-vfs/blob/master/src/org/pentaho/hdfs/vfs/ >> >> As you can see, VFS acts as middle man between the application and the API >> being used to access the "HDFS" backend. I used Pentaho before and know >> that this functionality happens to work reasonably well. >> >> >> Any thoughts ? >> >> >> >> >> >> [1] required if file ownership does not equal the user running the API >> endpoint >> [2] HttpFs >> [*] Ideally VFS upstream could be offered a PR to address this but not >> sure >> how feasible it to achieve this. >> >
