Re: [DISCUSS] Scale-out/Object Storage - taming the diversity of processors

James Wing Tue, 21 Feb 2017 20:46:57 -0800

Andre,

I definitely believe it is worth documenting this capability as a path to
storage providers that have no NiFi processors.  But I'm not sold on
dropping the processors we have now.  In addition to the great points made
by Andrew and Matt:


** Usability* - Specific storage processors provide an intuitive path for
the user thinking "I want to write this to S3".  Being specific to a
storage provider allows the processors to mirror provider terminology and
features.  It would be a difficult challenge to smoothly signal users that
they should use PutHDFS, but torture the configuration files until it
writes to S3.

** Advertising* - Having a broad array of storage processors gives NiFi
tangible, linkable, and Googleable answers to what NiFi supports.

** Positioning* - Stripping out other stores in favor of an
HDFS-library-first design would position NiFi closer to Hadoop/HDFS and
less of an independent mediator.  If only in some small way.

I also believe that the NiFi Registry initiative should help address the
processor explosion.


Thanks,

James

On Tue, Feb 21, 2017 at 3:45 PM, Matt Burgess <mattyb...@apache.org> wrote:

> I agree with Andrew in the operations sense, and would like to add
> that the user experience around dynamic properties (and even
> "conditional" properties that are not dynamic but can be exposed when
> other properties are "Applied") can be less-than-ideal and IMHO should
> be used sparingly. Full disclosure: My latest processor uses
> "conditional" properties at the moment, choosing them over dynamic
> properties in the hopes that the user experience is better, but
> without in-place updates (possibly implemented under [1]) and/or the
> UI making it obvious that dynamic properties are supported (under
> [2]), I'm not sure which is better (or if I should create different
> processors for my case as well).
>
> Under the hood, if it makes sense to group these processors and
> abstract away common code, then I'm all for it.  Especially if we can
> use something like the nifi-hadoop-libraries-nar as an ancestor NAR to
> provide a common set of libraries to all the Hadoop-Compatible File
> System (HCFS) implementations.  However I fear based on versions of
> the specific HCFS implementations, they may also need different
> versions of the HFS client dependencies, in which case we'd be looking
> for the Extension Registry and some smart classloading to alleviate
> those pain points without ballooning the NiFi footprint.
>
> Regards,
> Matt
>
> [1] https://issues.apache.org/jira/browse/NIFI-1121
> [2] https://issues.apache.org/jira/browse/NIFI-2629
>
>
> On Tue, Feb 21, 2017 at 6:21 PM, Andrew Grande <apere...@gmail.com> wrote:
> > Andre,
> >
> > I came across multiple NiFi use cases where going through the HDFS layer
> > and the fs plugin may not be possible. I.e. when no HDFS layer present at
> > all, so no NN to connect to.
> >
> > Another important aspect is operations. Current PutHDFS model with
> > additional jar location, well, it kinda works, but I very much dislike
> it.
> > Too many possibilities for a human error in addition to deployment pain,
> > especially in a cluster.
> >
> > Finally, native object storage processors have features which may not
> even
> > apply to the HDFS layer. E.g. the Azure storage has Table storage, etc.
> >
> > I agree consolidating various efforts is worthwhile, but only within a
> > context of a specific storage solution. Not 'unifying' them into a single
> > layer.
> >
> > Andrew
> >
> > On Tue, Feb 21, 2017, 6:10 PM Andre <andre-li...@fucs.org> wrote:
> >
> >> dev,
> >>
> >> I was having a chat with Pierre around PR#379 and we thought it would be
> >> worth sharing this with the wider group:
> >>
> >>
> >> I recently noticed that we merged a number of PRs and merges around
> >> scale-out/cloud based object store into the master.
> >>
> >> Would it make sense to start considering adopting a pattern where
> >> Put/Get/ListHDFS are used in tandem with implementations of the
> >> hadoop.filesystem interfaces instead of creating new processors, except
> >> where a particular deficiency/incompatibility in the hadoop.filesystem
> >> implementation exists?
> >>
> >> Candidates for removal / non merge would be:
> >>
> >> - Alluxio (PR#379)
> >> - WASB (PR#626)
> >>  - Azure* (PR#399)
> >> - *GCP (recently merged as PR#1482)
> >> - *S3 (although this has been in code so it would have to be deprecated)
> >>
> >> The pattern would be pretty much the same as the one documented and
> >> successfully deployed here:
> >>
> >> https://community.hortonworks.com/articles/71916/connecting-
> >> to-azure-data-lake-from-a-nifi-dataflow.html
> >>
> >> Which means that in the case of Alluxio, one would use the properties
> >> documented here:
> >>
> >> https://www.alluxio.com/docs/community/1.3/en/Running-
> >> Hadoop-MapReduce-on-Alluxio.html
> >>
> >> While with Google Cloud Storage we would use the properties documented
> >> here:
> >>
> >> https://cloud.google.com/hadoop/google-cloud-storage-connector
> >>
> >> I noticed that specific processors could have the ability to handle
> >> particular properties to a filesystem, however I would like to believe
> the
> >> same issue would plague hadoop users, and therefore is reasonable to
> >> believe the Hadoop compatible implementations would have ways of
> exposing
> >> those properties as well?
> >>
> >> In the case the properties are exposed, we perhaps simply adjust the
> *HDFS
> >> processors to use dynamic properties to pass those to the underlying
> >> module, therefore providing a way to explore particular settings of an
> >> underlying storage platforms.
> >>
> >> Any opinion would be welcome
> >>
> >> PS-sent it again with proper subject label
> >>
>

Re: [DISCUSS] Scale-out/Object Storage - taming the diversity of processors

Reply via email to