Bryan, Matt Thanks for the input, much appreciated.
Perhaps then the simplest option may be giving the user the option to compile against one particular vendor, while ensuring the project build points to the Apache licensed code. This way the user is given the following choices: User primary platform is Open-Source/HDP/CDH: Cluster can be accessed via native HDFS or WebHdfs can be used to access the 3rd party implementations[*]; User primary platform is MapR: Build against MapR JARs, gives access using MapR-FS and hopefully HDFS still works (depending on protocol version supported by the MapR JARs). If nothing works, then WebHdfs should come to the rescue[*] User primary platform is something else: WebHdfs or if possible, add another opt-in profile to the pom. I am not sure if licensing would prevent the profile from being included but I submitted an initial suggestion of what the profile may look like anyhow. What do you think? [*] WebHdfs functionality looks promising (see NIFI-1924) but is yet to be confirmed. On Tue, May 31, 2016 at 2:09 AM, Bryan Rosander <[email protected]> wrote: > Hey all, > > The pentaho-hdfs-vfs artifact is in the process of being deprecated and > superceded by vfs providers in their big data plugin. > > They wouldn't be sufficient for outside use either way as they depend on > the big data plugin, Kettle, and the correct shim (see Matt's comment) to > work. > > A more generic provider with some way to swap dependencies and > implementations could be a way forward. > > This could be a lot simpler than the above if it only cared about > providing vfs access to hdfs/maprfs. > > Thanks, > Bryan > > On May 30, 2016 10:56 AM, Matt Burgess <[email protected]> wrote: > As a former Pentaho employee, I can add some details around this: > > - Pentaho does/did have a fork of Apache VFS. Mostly it was the > application of bugs in the 2.x line against the 1.x codebase. Since > then they have contributed fixes from their branch back to the 2.x > fork, and when I left they were hoping for an upcoming release of > Apache VFS to include their fixes (so they didn't have to depend on a > local or SNAPSHOT build). IIRC the Apache VFS project was not under > (very) active development at the time. Perhaps with fixes for Accumulo > this has changed. > > - Pentaho's HDFS and MapRFS providers are not part of the VFS > fork/project, nor do I (or did we) believe they should be. The > dependencies for these providers are pretty big, and specific to the > Hadoop ecosystem. Having them as separate drop-in modules is a better > idea IMO. Also, the Hadoop JARs are "provided" dependencies of these > modules (so Hadoop is not included), this is so the same VFS provider > can (hopefully) be used against different versions of Hadoop. I guess > in that sense they could be added to Apache VFS proper, but then it's > on the VFS project to handle the other issues (see below). > > - Including the HDFS or MapRFS provider along with Apache VFS is not > enough to make it work. Pentaho had a classloading configuration where > they would bundle all the necessary code, JARs, configs, etc. into > what they call a "shim". They have shims for multiple versions of > multiple distributions of Hadoop (HDP, CDH, MapR, etc.). To support > this in NiFi, we would need some mechanism to get the right JARs from > the right places. If NiFi is on a cluster node, then the correct > Hadoop libraries are available already. In any case, it probably would > only work as a "Bring Your Own Hadoop". Currently NiFi includes a > vanilla version of Hadoop such that it can exist outside the Hadoop > cluster and still use client libraries/protocols to move data into > Hadoop. With the advent of the Extension Registry, the Bring Your Own > Hadoop solution becomes more viable. > > - Even with a Bring Your Own Hadoop solution, we'd need to isolate > that functionality behind a service for the purpose of allowing > multiple Hadoop vendors, versions, etc. That would allow such things > as a migration from one vendor's Hadoop to another. The alternatives > include WebHDFS and HttpFS as you mentioned (if they are enabled on > the system), and possibly Knox and/or Falcon depending on your use > case. As of now, Pentaho still only supports one Vendor/Version of > Hadoop at a time, though this has been logged for improvement > (http://jira.pentaho.com/browse/PDI-8121). > > - MapR has a somewhat different architecture than many other vendors. > For example, they have a native client the user must install on each > node intending to communicate with Hadoop. The JARs are installed in a > specific location and thus NiFi would need a way for the user to point > to that location. This is hopefully just a configuration issue (i.e. > where to point the Bring Your Own Hadoop solution), but I believe they > also have a different security model, so I imagine there will be some > MapR-specific processing to be done around Kerberos, e.g. > > There is a NiFi Jira that covers multiple Hadoop versions > (https://issues.apache.org/jira/browse/NIFI-710), hopefully good > discussion like this will inform that case (and end up in the > comments). > > Regards, > Matt > > P.S. Mr. Rosander :) If you're reading this, I'd love to get your > comments on this thread as well, thanks! > > On Sun, May 29, 2016 at 10:41 AM, Jim Hughes <[email protected]> wrote: > > Hi Andre, > > > > Your plan seems reasonable to me. The shortest path to verifying it > might > > be to drop in the pentaho-hdfs-vfs artifacts and remove any conflicting > VFS > > providers (or just their config (1)) from the Apache VFS jars. > > > > Some of the recent effort in VFS has been to address bugs which were > > relevant to Apache Accumulo. From hearing about that, it sounds like VFS > > may have a little smaller team. > > > > That said, it might be worth asking the Pentaho folks 1) if they could > > contribute their project to VFS and 2) how they leverage it. They might > have > > some guidance about how to use their project as replacement for the HDFS > > parts of VFS. > > > > Good luck! > > > > Jim > > > > 1. I'd peek at files like this: > > > https://github.com/pentaho/pentaho-hdfs-vfs/blob/master/res/META-INF/vfs-providers.xml > . > > > > On 5/29/2016 8:10 AM, Andre wrote: > >> > >> All, > >> > >> Not sure how many other MapR users are effectively using NiFi (I only > know > >> two others) but as you may remember from old threads that integrating > some > >> different flavours of HDFS compatible APIs can sometimes be puzzling and > >> require recompilation of bundles. > >> > >> However, recompilation doesn't solve scenarios where for whatever > reason, > >> a > >> user may want to use more than one HDFS provider (e.g. MapR + HDP, or > >> Isilon + MapR) and HDFS version are distinct (e.g. > >> > >> While WebHDFS and HttpFs are good palliative solutions to some of this > >> issue, they have their own limitations, the more striking ones being the > >> need to create Kerberos proxy users to run those services [1] and > >> potential > >> bottlenecks [2]. > >> > >> I was wondering if we could tap into the work Pentaho did around using a > >> fork of Apache VFS as an option to solve this issue and also to unify > the > >> .*MapR and .*HDFS processors.[*] > >> > >> Pentaho's code is Apache Licensed and is available here: > >> > >> > >> > https://github.com/pentaho/pentaho-hdfs-vfs/blob/master/src/org/pentaho/hdfs/vfs/ > >> > >> As you can see, VFS acts as middle man between the application and the > API > >> being used to access the "HDFS" backend. I used Pentaho before and know > >> that this functionality happens to work reasonably well. > >> > >> > >> Any thoughts ? > >> > >> > >> > >> > >> > >> [1] required if file ownership does not equal the user running the API > >> endpoint > >> [2] HttpFs > >> [*] Ideally VFS upstream could be offered a PR to address this but not > >> sure > >> how feasible it to achieve this. > >> > > > >
