Re: [DISCUSS] Scale-out/Object Storage - taming the diversity of processors

Oleg Zhurakousky Wed, 22 Feb 2017 10:23:15 -0800

Just wanted to add one more point which IMHO just as important. . .
Certain “artifacts” (i.e., NARs that depends on libraries which are not ASF 
friendly) may not fit the ASF licensing requirements of genuine Apache NiFi 
distribution, yet add a great value for greater community of NiFi users, so 
having them NOT being part of official NiFi distribution is a value in itself.


Cheers
Oleg

> On Feb 22, 2017, at 12:52 PM, Oleg Zhurakousky <[email protected]> 
> wrote:
> 
> Adam
> 
> I 100% agree with your comment on "official/sanctioned”. As an external 
> artifact registry such as BinTray for example or GitHub, one can not control 
> what is there, rather how to get it. The final decision is left to the end 
> user.
> Artifacts could be rated and/or Apache NiFi (and/or commercial distributions 
> of NiFi) can “endorse” and/or “un-endorse” certain artifacts and IMHO that is 
> perfectly fine. On top of that a future distribution of NiFi can have 
> configuration to account for the “endorsed/supported” artifacts, yet it 
> should not stop one from downloading and trying something new.
> 
> Cheers
> Oleg
> 
>> On Feb 22, 2017, at 12:43 PM, Adam Lamar <[email protected]> wrote:
>> 
>> Hey all,
>> 
>> I can understand Andre's perspective - when I was building the ListS3
>> processor, I mostly just copied the bits that made sense from ListHDFS and
>> ListFile. That worked, but its a poor way to ensure consistency across
>> List* processors.
>> 
>> As a once-in-a-while contributor, I love the idea that community
>> contributions are respected and we're not dropping them, because they solve
>> real needs right now, and it isn't clear another approach would be better.
>> 
>> And I disagree slightly with the notion that an artifact registry will
>> solve the problem - I think it could make it worse, at least from a
>> consistency point of view. Taming _is_ important, which is one reason
>> registry communities have official/sanctioned modules. Quality and
>> interoperability can vary vastly.
>> 
>> By convention, it seems like NiFi already has a handful of well-understood
>> patterns - List, Fetch, Get, Put, etc all mean something specific in
>> processor terms. Is there a reason not to formalize those patterns in the
>> code as well? That would help with processor consistency, and if done
>> right, it may even be easier to write new processors, fix bugs, etc.
>> 
>> For example, ListS3 initially shipped with some bad session commit()
>> behavior, which was obvious once identified, but a generalized
>> AbstractListProcessor (higher level that the one that already exists) could
>> make it easier to avoid this class of bug.
>> 
>> Admittedly this could be a lot of work.
>> 
>> Cheers,
>> Adam
>> 
>> 
>> 
>> On Wed, Feb 22, 2017 at 8:38 AM, Oleg Zhurakousky <
>> [email protected]> wrote:
>> 
>>> I’ll second Pierre
>>> 
>>> Yes with the current deployment model the amount of processors and the
>>> size of NiFi distribution is a concern simply because it’s growing with
>>> each release. But it should not be the driver to start jamming more
>>> functionality into existing processors which on the surface may look like
>>> related (even if they are).
>>> Basically a processor should never be complex with regard to it being
>>> understood by the end user who is non-technical, so “specialization” is
>>> always takes precedence here since it limits “configuration” and thus
>>> making such processor simpler. It also helps with maintenance and
>>> management of such processor by the developer. Also, having multiple
>>> related processors will promote healthy competition where my MyputHDFS may
>>> for certain cases be better/faster then YourPutHDFS and why not have both?
>>> 
>>> The “artifact registry” (flow, extension, template etc) is the only answer
>>> here since it will remove the “proliferation” and the need for “taming”
>>> anything from the picture. With “artifact registry” one or one million
>>> processors, the NiFi size/state will always remain constant and small.
>>> 
>>> Cheers
>>> Oleg
>>>> On Feb 22, 2017, at 6:05 AM, Pierre Villard <[email protected]>
>>> wrote:
>>>> 
>>>> Hey guys,
>>>> 
>>>> Thanks for the thread Andre.
>>>> 
>>>> +1 to James' answer.
>>>> 
>>>> I understand the interest that would provide a single processor to
>>> connect
>>>> to all the back ends... and we could document/improve the PutHDFS to ease
>>>> such use but I really don't think that it will benefit the user
>>> experience.
>>>> That may be interesting in some cases for some users but I don't think
>>> that
>>>> would be a majority.
>>>> 
>>>> I believe NiFi is great for one reason: you have a lot of specialized
>>>> processors that are really easy to use and efficient for what they've
>>> been
>>>> designed for.
>>>> 
>>>> Let's ask ourselves the question the other way: with the NiFi registry on
>>>> its way, what is the problem having multiple processors for each back
>>> end?
>>>> I don't really see the issue here. OK we have a lot of processors (but I
>>>> believe this is a good point for NiFi, for user experience, for
>>>> advertising, etc. - maybe we should improve the processor listing though,
>>>> but again, this will be part of the NiFi Registry work), it generates a
>>>> heavy NiFi binary (but that will be solved with the registry), but that's
>>>> all, no?
>>>> 
>>>> Also agree on the positioning aspect: IMO NiFi should not be highly tied
>>> to
>>>> the Hadoop ecosystem. There is a lot of users using NiFi with absolutely
>>> no
>>>> relation to Hadoop. Not sure that would send the good "signal".
>>>> 
>>>> Pierre
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 2017-02-22 6:50 GMT+01:00 Andre <[email protected]>:
>>>> 
>>>>> Andrew,
>>>>> 
>>>>> 
>>>>> On Wed, Feb 22, 2017 at 11:21 AM, Andrew Grande <[email protected]>
>>>>> wrote:
>>>>> 
>>>>>> I am observing one assumption in this thread. For some reason we are
>>>>>> implying all these will be hadoop compatible file systems. They don't
>>>>>> always have an HDFS plugin, nor should they as a mandatory requirement.
>>>>>> 
>>>>> 
>>>>> You are partially correct.
>>>>> 
>>>>> There is a direct assumption in the availability of a HCFS (thanks
>>> Matt!)
>>>>> implementation.
>>>>> 
>>>>> This is the case with:
>>>>> 
>>>>> * Windows Azure Blob Storage
>>>>> * Google Cloud Storage Connector
>>>>> * MapR FileSystem (currently done via NAR recompilation / mvn profile)
>>>>> * Alluxio
>>>>> * Isilon (via HDFS)
>>>>> * others
>>>>> 
>>>>> But I would't say this will apply to every other use storage system and
>>> in
>>>>> certain cases may not even be necessary (e.g. Isilon scale-out storage
>>> may
>>>>> be reached using its native HDFS compatible interfaces).
>>>>> 
>>>>> 
>>>>> Untie completely from the Hadoop nar. This allows for effective minifi
>>>>>> interaction without the weight of hadoop libs for example. Massive size
>>>>>> savings where it matters.
>>>>>> 
>>>>>> 
>>>>> Are you suggesting a use case were MiNiFi agents interact directly with
>>>>> cloud storage, without relying on NiFi hubs to do that?
>>>>> 
>>>>> 
>>>>>> For the deployment, it's easy enough for an admin to either rely on a
>>>>>> standard tar or rpm if the NAR modules are already available in the
>>>>> distro
>>>>>> (well, I won't talk registry till it arrives). Mounting a common
>>>>> directory
>>>>>> on every node or distributing additional jars everywhere, plus configs,
>>>>> and
>>>>>> then keeping it consistent across is something which can be avoided by
>>>>>> simpler packaging.
>>>>>> 
>>>>> 
>>>>> As long the NAR or RPM supports your use-case, which is not the case of
>>>>> people running NiFi with MapR-FS for example. For those, a
>>> recompilation is
>>>>> required anyway. A flexible processor may remove the need to recompile
>>> (I
>>>>> am currently playing with the classpath implication to MapR users).
>>>>> 
>>>>> Cheers
>>>>> 
>>> 
>>> 
>

Re: [DISCUSS] Scale-out/Object Storage - taming the diversity of processors

Reply via email to