Re: [DISCUSS] Scale-out/Object Storage - taming the diversity of processors

2017-02-22 Thread Andre
Joe,

Thanks for the comments!

Slightly deviating from code consistency discussion but I think you raised
some important points.

I may on the glass half empty side, but being a user who witnessed many
time how modularity played on other communities i am not particular excited
about the prospect.

I agree with Olegz comments about the benefits of the registry to code that
due to licensing issues would not be able to merged into NiFi, I am excited
about faster builds and selective packaging but past IMNHSO things get ugly
very quickly.

To be 100% transparent it is probably because I personally benefited from
having my code reviewed and subjected to the always helpful input of people
like yourself, Aldrin, BBende, JPercivall, Matt B, Oleg,  Pierre,  and the
rest of the wider community. I think of how ListenSMTP evolved in two days
more than I would be able to improve it in a lifetime, all thanks to a very
strong core of contributors around this community.

Perhaps it is because as a volunteer I have had to manage some sites build
around WordPress and noticed that quite frequently a WordPress site is a
see of plugin modules that as time progresses start getting outdated,
turning the long term maintenance of the platform an absolute pain. And to
be honest it has a healthy ecosystem around it: free modules, commercial
modules, independent registries, product reviews and support forums.

At this stage I would like to point out: Support forums - we need to start
thinking about how we plan to manage third party plugins as shared mailing
lists are woefully inadequate to do so. Even Elastic seems to have ended up
using both its support forums and GH's issue pages to provide support to
users.

In any case, back to the original issue:



I think I will second Adam's comments around not assuming the registry will
help around the original issue highlighted: Consistency, Code repetition
and long term support.



Cheers



On Thu, Feb 23, 2017 at 6:25 AM, Joe Witt  wrote:

> more good points...
>
> We will need to think through/document the benefits that this
> extension bazar would provide and explain the risks it also brings to
> users.  We should document for developers best practices in building
> new components or extending existing ones and we should allow users to
> socialize their findings for processors.  If someone sees a processor
> with 'one star' from a few users versus '4.5' from a thousand they
> would get a different level of confidence.  We should in general think
> about how to 'mature' components in these registries and such.
>
> That said, I think some of the problems we're talking about we'd be
> very honored and fortunate to be solving.  Right now we're making it
> hard to build and develop releases and slowing our agility as a
> community.
>
> Good problems to have though!
>
> joe
>
>


Re: [DISCUSS] Scale-out/Object Storage - taming the diversity of processors

2017-02-22 Thread Joe Witt
more good points...

We will need to think through/document the benefits that this
extension bazar would provide and explain the risks it also brings to
users.  We should document for developers best practices in building
new components or extending existing ones and we should allow users to
socialize their findings for processors.  If someone sees a processor
with 'one star' from a few users versus '4.5' from a thousand they
would get a different level of confidence.  We should in general think
about how to 'mature' components in these registries and such.

That said, I think some of the problems we're talking about we'd be
very honored and fortunate to be solving.  Right now we're making it
hard to build and develop releases and slowing our agility as a
community.

Good problems to have though!

joe

On Wed, Feb 22, 2017 at 2:20 PM, Adam Lamar  wrote:
> To be clear, I really love the idea of an extension registry and have at
> least one custom processor that would be a great fit for some of the
> reasons listed by Oleg, and its really cool thinking that user data can
> drive NiFi improvements. We're on the same page there.
>
> Let's go back to one of Andre's points as I understand it: Many of the
> processors that do similar things are different in lots of minor but
> unnecessary ways, and this hurts the user experience.
>
> With an extension registry, NiFi users would potentially have access to
> even more processors, but these processors don't undergo the same code
> review as they would being introduced into the mainline tree today, and if
> someone writes ListX or ListY, there seems to be little incentive for them
> to match existing processor behavior, because ListX and ListY exist
> independently from the core processors.
>
> I've already shown my hand, but I'm interested to hear what others think.
> Is this lack of consistency a problem, and if so, how does NiFi mitigate
> the potential issues?
>
> Adam
>
>
> On Wed, Feb 22, 2017 at 11:22 AM, Oleg Zhurakousky <
> ozhurakou...@hortonworks.com> wrote:
>
>> Just wanted to add one more point which IMHO just as important. . .
>> Certain “artifacts” (i.e., NARs that depends on libraries which are not
>> ASF friendly) may not fit the ASF licensing requirements of genuine Apache
>> NiFi distribution, yet add a great value for greater community of NiFi
>> users, so having them NOT being part of official NiFi distribution is a
>> value in itself.
>>
>> Cheers
>> Oleg
>>
>> > On Feb 22, 2017, at 12:52 PM, Oleg Zhurakousky <
>> ozhurakou...@hortonworks.com> wrote:
>> >
>> > Adam
>> >
>> > I 100% agree with your comment on "official/sanctioned”. As an external
>> artifact registry such as BinTray for example or GitHub, one can not
>> control what is there, rather how to get it. The final decision is left to
>> the end user.
>> > Artifacts could be rated and/or Apache NiFi (and/or commercial
>> distributions of NiFi) can “endorse” and/or “un-endorse” certain artifacts
>> and IMHO that is perfectly fine. On top of that a future distribution of
>> NiFi can have configuration to account for the “endorsed/supported”
>> artifacts, yet it should not stop one from downloading and trying something
>> new.
>> >
>> > Cheers
>> > Oleg
>> >
>> >> On Feb 22, 2017, at 12:43 PM, Adam Lamar  wrote:
>> >>
>> >> Hey all,
>> >>
>> >> I can understand Andre's perspective - when I was building the ListS3
>> >> processor, I mostly just copied the bits that made sense from ListHDFS
>> and
>> >> ListFile. That worked, but its a poor way to ensure consistency across
>> >> List* processors.
>> >>
>> >> As a once-in-a-while contributor, I love the idea that community
>> >> contributions are respected and we're not dropping them, because they
>> solve
>> >> real needs right now, and it isn't clear another approach would be
>> better.
>> >>
>> >> And I disagree slightly with the notion that an artifact registry will
>> >> solve the problem - I think it could make it worse, at least from a
>> >> consistency point of view. Taming _is_ important, which is one reason
>> >> registry communities have official/sanctioned modules. Quality and
>> >> interoperability can vary vastly.
>> >>
>> >> By convention, it seems like NiFi already has a handful of
>> well-understood
>> >> patterns - List, Fetch, Get, Put, etc all mean something specific in
>> >> processor terms. Is there a reason not to formalize those patterns in
>> the
>> >> code as well? That would help with processor consistency, and if done
>> >> right, it may even be easier to write new processors, fix bugs, etc.
>> >>
>> >> For example, ListS3 initially shipped with some bad session commit()
>> >> behavior, which was obvious once identified, but a generalized
>> >> AbstractListProcessor (higher level that the one that already exists)
>> could
>> >> make it easier to avoid this class of bug.
>> >>
>> >> Admittedly this could be a lot of work.
>> >>
>> >> Cheers,
>> >> Adam
>> >>
>> >>
>> >>
>> 

Re: [DISCUSS] Scale-out/Object Storage - taming the diversity of processors

2017-02-22 Thread Adam Lamar
To be clear, I really love the idea of an extension registry and have at
least one custom processor that would be a great fit for some of the
reasons listed by Oleg, and its really cool thinking that user data can
drive NiFi improvements. We're on the same page there.

Let's go back to one of Andre's points as I understand it: Many of the
processors that do similar things are different in lots of minor but
unnecessary ways, and this hurts the user experience.

With an extension registry, NiFi users would potentially have access to
even more processors, but these processors don't undergo the same code
review as they would being introduced into the mainline tree today, and if
someone writes ListX or ListY, there seems to be little incentive for them
to match existing processor behavior, because ListX and ListY exist
independently from the core processors.

I've already shown my hand, but I'm interested to hear what others think.
Is this lack of consistency a problem, and if so, how does NiFi mitigate
the potential issues?

Adam


On Wed, Feb 22, 2017 at 11:22 AM, Oleg Zhurakousky <
ozhurakou...@hortonworks.com> wrote:

> Just wanted to add one more point which IMHO just as important. . .
> Certain “artifacts” (i.e., NARs that depends on libraries which are not
> ASF friendly) may not fit the ASF licensing requirements of genuine Apache
> NiFi distribution, yet add a great value for greater community of NiFi
> users, so having them NOT being part of official NiFi distribution is a
> value in itself.
>
> Cheers
> Oleg
>
> > On Feb 22, 2017, at 12:52 PM, Oleg Zhurakousky <
> ozhurakou...@hortonworks.com> wrote:
> >
> > Adam
> >
> > I 100% agree with your comment on "official/sanctioned”. As an external
> artifact registry such as BinTray for example or GitHub, one can not
> control what is there, rather how to get it. The final decision is left to
> the end user.
> > Artifacts could be rated and/or Apache NiFi (and/or commercial
> distributions of NiFi) can “endorse” and/or “un-endorse” certain artifacts
> and IMHO that is perfectly fine. On top of that a future distribution of
> NiFi can have configuration to account for the “endorsed/supported”
> artifacts, yet it should not stop one from downloading and trying something
> new.
> >
> > Cheers
> > Oleg
> >
> >> On Feb 22, 2017, at 12:43 PM, Adam Lamar  wrote:
> >>
> >> Hey all,
> >>
> >> I can understand Andre's perspective - when I was building the ListS3
> >> processor, I mostly just copied the bits that made sense from ListHDFS
> and
> >> ListFile. That worked, but its a poor way to ensure consistency across
> >> List* processors.
> >>
> >> As a once-in-a-while contributor, I love the idea that community
> >> contributions are respected and we're not dropping them, because they
> solve
> >> real needs right now, and it isn't clear another approach would be
> better.
> >>
> >> And I disagree slightly with the notion that an artifact registry will
> >> solve the problem - I think it could make it worse, at least from a
> >> consistency point of view. Taming _is_ important, which is one reason
> >> registry communities have official/sanctioned modules. Quality and
> >> interoperability can vary vastly.
> >>
> >> By convention, it seems like NiFi already has a handful of
> well-understood
> >> patterns - List, Fetch, Get, Put, etc all mean something specific in
> >> processor terms. Is there a reason not to formalize those patterns in
> the
> >> code as well? That would help with processor consistency, and if done
> >> right, it may even be easier to write new processors, fix bugs, etc.
> >>
> >> For example, ListS3 initially shipped with some bad session commit()
> >> behavior, which was obvious once identified, but a generalized
> >> AbstractListProcessor (higher level that the one that already exists)
> could
> >> make it easier to avoid this class of bug.
> >>
> >> Admittedly this could be a lot of work.
> >>
> >> Cheers,
> >> Adam
> >>
> >>
> >>
> >> On Wed, Feb 22, 2017 at 8:38 AM, Oleg Zhurakousky <
> >> ozhurakou...@hortonworks.com> wrote:
> >>
> >>> I’ll second Pierre
> >>>
> >>> Yes with the current deployment model the amount of processors and the
> >>> size of NiFi distribution is a concern simply because it’s growing with
> >>> each release. But it should not be the driver to start jamming more
> >>> functionality into existing processors which on the surface may look
> like
> >>> related (even if they are).
> >>> Basically a processor should never be complex with regard to it being
> >>> understood by the end user who is non-technical, so “specialization” is
> >>> always takes precedence here since it limits “configuration” and thus
> >>> making such processor simpler. It also helps with maintenance and
> >>> management of such processor by the developer. Also, having multiple
> >>> related processors will promote healthy competition where my MyputHDFS
> may
> >>> for certain cases be better/faster then YourPutHDFS and why 

Re: [DISCUSS] Scale-out/Object Storage - taming the diversity of processors

2017-02-22 Thread Oleg Zhurakousky
Just wanted to add one more point which IMHO just as important. . .
Certain “artifacts” (i.e., NARs that depends on libraries which are not ASF 
friendly) may not fit the ASF licensing requirements of genuine Apache NiFi 
distribution, yet add a great value for greater community of NiFi users, so 
having them NOT being part of official NiFi distribution is a value in itself.

Cheers
Oleg

> On Feb 22, 2017, at 12:52 PM, Oleg Zhurakousky  
> wrote:
> 
> Adam
> 
> I 100% agree with your comment on "official/sanctioned”. As an external 
> artifact registry such as BinTray for example or GitHub, one can not control 
> what is there, rather how to get it. The final decision is left to the end 
> user.
> Artifacts could be rated and/or Apache NiFi (and/or commercial distributions 
> of NiFi) can “endorse” and/or “un-endorse” certain artifacts and IMHO that is 
> perfectly fine. On top of that a future distribution of NiFi can have 
> configuration to account for the “endorsed/supported” artifacts, yet it 
> should not stop one from downloading and trying something new.
> 
> Cheers
> Oleg
> 
>> On Feb 22, 2017, at 12:43 PM, Adam Lamar  wrote:
>> 
>> Hey all,
>> 
>> I can understand Andre's perspective - when I was building the ListS3
>> processor, I mostly just copied the bits that made sense from ListHDFS and
>> ListFile. That worked, but its a poor way to ensure consistency across
>> List* processors.
>> 
>> As a once-in-a-while contributor, I love the idea that community
>> contributions are respected and we're not dropping them, because they solve
>> real needs right now, and it isn't clear another approach would be better.
>> 
>> And I disagree slightly with the notion that an artifact registry will
>> solve the problem - I think it could make it worse, at least from a
>> consistency point of view. Taming _is_ important, which is one reason
>> registry communities have official/sanctioned modules. Quality and
>> interoperability can vary vastly.
>> 
>> By convention, it seems like NiFi already has a handful of well-understood
>> patterns - List, Fetch, Get, Put, etc all mean something specific in
>> processor terms. Is there a reason not to formalize those patterns in the
>> code as well? That would help with processor consistency, and if done
>> right, it may even be easier to write new processors, fix bugs, etc.
>> 
>> For example, ListS3 initially shipped with some bad session commit()
>> behavior, which was obvious once identified, but a generalized
>> AbstractListProcessor (higher level that the one that already exists) could
>> make it easier to avoid this class of bug.
>> 
>> Admittedly this could be a lot of work.
>> 
>> Cheers,
>> Adam
>> 
>> 
>> 
>> On Wed, Feb 22, 2017 at 8:38 AM, Oleg Zhurakousky <
>> ozhurakou...@hortonworks.com> wrote:
>> 
>>> I’ll second Pierre
>>> 
>>> Yes with the current deployment model the amount of processors and the
>>> size of NiFi distribution is a concern simply because it’s growing with
>>> each release. But it should not be the driver to start jamming more
>>> functionality into existing processors which on the surface may look like
>>> related (even if they are).
>>> Basically a processor should never be complex with regard to it being
>>> understood by the end user who is non-technical, so “specialization” is
>>> always takes precedence here since it limits “configuration” and thus
>>> making such processor simpler. It also helps with maintenance and
>>> management of such processor by the developer. Also, having multiple
>>> related processors will promote healthy competition where my MyputHDFS may
>>> for certain cases be better/faster then YourPutHDFS and why not have both?
>>> 
>>> The “artifact registry” (flow, extension, template etc) is the only answer
>>> here since it will remove the “proliferation” and the need for “taming”
>>> anything from the picture. With “artifact registry” one or one million
>>> processors, the NiFi size/state will always remain constant and small.
>>> 
>>> Cheers
>>> Oleg
 On Feb 22, 2017, at 6:05 AM, Pierre Villard 
>>> wrote:
 
 Hey guys,
 
 Thanks for the thread Andre.
 
 +1 to James' answer.
 
 I understand the interest that would provide a single processor to
>>> connect
 to all the back ends... and we could document/improve the PutHDFS to ease
 such use but I really don't think that it will benefit the user
>>> experience.
 That may be interesting in some cases for some users but I don't think
>>> that
 would be a majority.
 
 I believe NiFi is great for one reason: you have a lot of specialized
 processors that are really easy to use and efficient for what they've
>>> been
 designed for.
 
 Let's ask ourselves the question the other way: with the NiFi registry on
 its way, what is the problem having multiple processors for each back
>>> end?
 I don't really 

Re: [DISCUSS] Scale-out/Object Storage - taming the diversity of processors

2017-02-22 Thread Oleg Zhurakousky
Adam

I 100% agree with your comment on "official/sanctioned”. As an external 
artifact registry such as BinTray for example or GitHub, one can not control 
what is there, rather how to get it. The final decision is left to the end user.
Artifacts could be rated and/or Apache NiFi (and/or commercial distributions of 
NiFi) can “endorse” and/or “un-endorse” certain artifacts and IMHO that is 
perfectly fine. On top of that a future distribution of NiFi can have 
configuration to account for the “endorsed/supported” artifacts, yet it should 
not stop one from downloading and trying something new.

Cheers
Oleg

> On Feb 22, 2017, at 12:43 PM, Adam Lamar  wrote:
> 
> Hey all,
> 
> I can understand Andre's perspective - when I was building the ListS3
> processor, I mostly just copied the bits that made sense from ListHDFS and
> ListFile. That worked, but its a poor way to ensure consistency across
> List* processors.
> 
> As a once-in-a-while contributor, I love the idea that community
> contributions are respected and we're not dropping them, because they solve
> real needs right now, and it isn't clear another approach would be better.
> 
> And I disagree slightly with the notion that an artifact registry will
> solve the problem - I think it could make it worse, at least from a
> consistency point of view. Taming _is_ important, which is one reason
> registry communities have official/sanctioned modules. Quality and
> interoperability can vary vastly.
> 
> By convention, it seems like NiFi already has a handful of well-understood
> patterns - List, Fetch, Get, Put, etc all mean something specific in
> processor terms. Is there a reason not to formalize those patterns in the
> code as well? That would help with processor consistency, and if done
> right, it may even be easier to write new processors, fix bugs, etc.
> 
> For example, ListS3 initially shipped with some bad session commit()
> behavior, which was obvious once identified, but a generalized
> AbstractListProcessor (higher level that the one that already exists) could
> make it easier to avoid this class of bug.
> 
> Admittedly this could be a lot of work.
> 
> Cheers,
> Adam
> 
> 
> 
> On Wed, Feb 22, 2017 at 8:38 AM, Oleg Zhurakousky <
> ozhurakou...@hortonworks.com> wrote:
> 
>> I’ll second Pierre
>> 
>> Yes with the current deployment model the amount of processors and the
>> size of NiFi distribution is a concern simply because it’s growing with
>> each release. But it should not be the driver to start jamming more
>> functionality into existing processors which on the surface may look like
>> related (even if they are).
>> Basically a processor should never be complex with regard to it being
>> understood by the end user who is non-technical, so “specialization” is
>> always takes precedence here since it limits “configuration” and thus
>> making such processor simpler. It also helps with maintenance and
>> management of such processor by the developer. Also, having multiple
>> related processors will promote healthy competition where my MyputHDFS may
>> for certain cases be better/faster then YourPutHDFS and why not have both?
>> 
>> The “artifact registry” (flow, extension, template etc) is the only answer
>> here since it will remove the “proliferation” and the need for “taming”
>> anything from the picture. With “artifact registry” one or one million
>> processors, the NiFi size/state will always remain constant and small.
>> 
>> Cheers
>> Oleg
>>> On Feb 22, 2017, at 6:05 AM, Pierre Villard 
>> wrote:
>>> 
>>> Hey guys,
>>> 
>>> Thanks for the thread Andre.
>>> 
>>> +1 to James' answer.
>>> 
>>> I understand the interest that would provide a single processor to
>> connect
>>> to all the back ends... and we could document/improve the PutHDFS to ease
>>> such use but I really don't think that it will benefit the user
>> experience.
>>> That may be interesting in some cases for some users but I don't think
>> that
>>> would be a majority.
>>> 
>>> I believe NiFi is great for one reason: you have a lot of specialized
>>> processors that are really easy to use and efficient for what they've
>> been
>>> designed for.
>>> 
>>> Let's ask ourselves the question the other way: with the NiFi registry on
>>> its way, what is the problem having multiple processors for each back
>> end?
>>> I don't really see the issue here. OK we have a lot of processors (but I
>>> believe this is a good point for NiFi, for user experience, for
>>> advertising, etc. - maybe we should improve the processor listing though,
>>> but again, this will be part of the NiFi Registry work), it generates a
>>> heavy NiFi binary (but that will be solved with the registry), but that's
>>> all, no?
>>> 
>>> Also agree on the positioning aspect: IMO NiFi should not be highly tied
>> to
>>> the Hadoop ecosystem. There is a lot of users using NiFi with absolutely
>> no
>>> relation to Hadoop. Not sure that would send the good 

Re: [DISCUSS] Scale-out/Object Storage - taming the diversity of processors

2017-02-22 Thread Joe Witt
Adam,

Some great points there.  I think what would be good here to keep in
mind is 'who' will tame these things.

For various patterns that are chosen and abstractions found and code written:
  - The developers do the taming.

For the extension registry and which processors become popular or
become unused and phase out:
 - The users/flow managers do the taming.

It is certainly the case we need to think through a robust plan which
allows both developers and users to provide the feedback and energy
necessary.  To date, we've not allowed the users to have much direct
influence here and we really don't have a strong sense of which
components are most commonly used.  One of the things I am most
excited by with the extension registry and related efforts is that it
will help us make more data driven decisions about where to focus our
energies.

Thanks
Joe

On Wed, Feb 22, 2017 at 12:43 PM, Adam Lamar  wrote:
> Hey all,
>
> I can understand Andre's perspective - when I was building the ListS3
> processor, I mostly just copied the bits that made sense from ListHDFS and
> ListFile. That worked, but its a poor way to ensure consistency across
> List* processors.
>
> As a once-in-a-while contributor, I love the idea that community
> contributions are respected and we're not dropping them, because they solve
> real needs right now, and it isn't clear another approach would be better.
>
> And I disagree slightly with the notion that an artifact registry will
> solve the problem - I think it could make it worse, at least from a
> consistency point of view. Taming _is_ important, which is one reason
> registry communities have official/sanctioned modules. Quality and
> interoperability can vary vastly.
>
> By convention, it seems like NiFi already has a handful of well-understood
> patterns - List, Fetch, Get, Put, etc all mean something specific in
> processor terms. Is there a reason not to formalize those patterns in the
> code as well? That would help with processor consistency, and if done
> right, it may even be easier to write new processors, fix bugs, etc.
>
> For example, ListS3 initially shipped with some bad session commit()
> behavior, which was obvious once identified, but a generalized
> AbstractListProcessor (higher level that the one that already exists) could
> make it easier to avoid this class of bug.
>
> Admittedly this could be a lot of work.
>
> Cheers,
> Adam
>
>
>
> On Wed, Feb 22, 2017 at 8:38 AM, Oleg Zhurakousky <
> ozhurakou...@hortonworks.com> wrote:
>
>> I’ll second Pierre
>>
>> Yes with the current deployment model the amount of processors and the
>> size of NiFi distribution is a concern simply because it’s growing with
>> each release. But it should not be the driver to start jamming more
>> functionality into existing processors which on the surface may look like
>> related (even if they are).
>> Basically a processor should never be complex with regard to it being
>> understood by the end user who is non-technical, so “specialization” is
>> always takes precedence here since it limits “configuration” and thus
>> making such processor simpler. It also helps with maintenance and
>> management of such processor by the developer. Also, having multiple
>> related processors will promote healthy competition where my MyputHDFS may
>> for certain cases be better/faster then YourPutHDFS and why not have both?
>>
>> The “artifact registry” (flow, extension, template etc) is the only answer
>> here since it will remove the “proliferation” and the need for “taming”
>> anything from the picture. With “artifact registry” one or one million
>> processors, the NiFi size/state will always remain constant and small.
>>
>> Cheers
>> Oleg
>> > On Feb 22, 2017, at 6:05 AM, Pierre Villard 
>> wrote:
>> >
>> > Hey guys,
>> >
>> > Thanks for the thread Andre.
>> >
>> > +1 to James' answer.
>> >
>> > I understand the interest that would provide a single processor to
>> connect
>> > to all the back ends... and we could document/improve the PutHDFS to ease
>> > such use but I really don't think that it will benefit the user
>> experience.
>> > That may be interesting in some cases for some users but I don't think
>> that
>> > would be a majority.
>> >
>> > I believe NiFi is great for one reason: you have a lot of specialized
>> > processors that are really easy to use and efficient for what they've
>> been
>> > designed for.
>> >
>> > Let's ask ourselves the question the other way: with the NiFi registry on
>> > its way, what is the problem having multiple processors for each back
>> end?
>> > I don't really see the issue here. OK we have a lot of processors (but I
>> > believe this is a good point for NiFi, for user experience, for
>> > advertising, etc. - maybe we should improve the processor listing though,
>> > but again, this will be part of the NiFi Registry work), it generates a
>> > heavy NiFi binary (but that will be solved with the registry), but 

Re: [DISCUSS] Scale-out/Object Storage - taming the diversity of processors

2017-02-22 Thread Adam Lamar
Hey all,

I can understand Andre's perspective - when I was building the ListS3
processor, I mostly just copied the bits that made sense from ListHDFS and
ListFile. That worked, but its a poor way to ensure consistency across
List* processors.

As a once-in-a-while contributor, I love the idea that community
contributions are respected and we're not dropping them, because they solve
real needs right now, and it isn't clear another approach would be better.

And I disagree slightly with the notion that an artifact registry will
solve the problem - I think it could make it worse, at least from a
consistency point of view. Taming _is_ important, which is one reason
registry communities have official/sanctioned modules. Quality and
interoperability can vary vastly.

By convention, it seems like NiFi already has a handful of well-understood
patterns - List, Fetch, Get, Put, etc all mean something specific in
processor terms. Is there a reason not to formalize those patterns in the
code as well? That would help with processor consistency, and if done
right, it may even be easier to write new processors, fix bugs, etc.

For example, ListS3 initially shipped with some bad session commit()
behavior, which was obvious once identified, but a generalized
AbstractListProcessor (higher level that the one that already exists) could
make it easier to avoid this class of bug.

Admittedly this could be a lot of work.

Cheers,
Adam



On Wed, Feb 22, 2017 at 8:38 AM, Oleg Zhurakousky <
ozhurakou...@hortonworks.com> wrote:

> I’ll second Pierre
>
> Yes with the current deployment model the amount of processors and the
> size of NiFi distribution is a concern simply because it’s growing with
> each release. But it should not be the driver to start jamming more
> functionality into existing processors which on the surface may look like
> related (even if they are).
> Basically a processor should never be complex with regard to it being
> understood by the end user who is non-technical, so “specialization” is
> always takes precedence here since it limits “configuration” and thus
> making such processor simpler. It also helps with maintenance and
> management of such processor by the developer. Also, having multiple
> related processors will promote healthy competition where my MyputHDFS may
> for certain cases be better/faster then YourPutHDFS and why not have both?
>
> The “artifact registry” (flow, extension, template etc) is the only answer
> here since it will remove the “proliferation” and the need for “taming”
> anything from the picture. With “artifact registry” one or one million
> processors, the NiFi size/state will always remain constant and small.
>
> Cheers
> Oleg
> > On Feb 22, 2017, at 6:05 AM, Pierre Villard 
> wrote:
> >
> > Hey guys,
> >
> > Thanks for the thread Andre.
> >
> > +1 to James' answer.
> >
> > I understand the interest that would provide a single processor to
> connect
> > to all the back ends... and we could document/improve the PutHDFS to ease
> > such use but I really don't think that it will benefit the user
> experience.
> > That may be interesting in some cases for some users but I don't think
> that
> > would be a majority.
> >
> > I believe NiFi is great for one reason: you have a lot of specialized
> > processors that are really easy to use and efficient for what they've
> been
> > designed for.
> >
> > Let's ask ourselves the question the other way: with the NiFi registry on
> > its way, what is the problem having multiple processors for each back
> end?
> > I don't really see the issue here. OK we have a lot of processors (but I
> > believe this is a good point for NiFi, for user experience, for
> > advertising, etc. - maybe we should improve the processor listing though,
> > but again, this will be part of the NiFi Registry work), it generates a
> > heavy NiFi binary (but that will be solved with the registry), but that's
> > all, no?
> >
> > Also agree on the positioning aspect: IMO NiFi should not be highly tied
> to
> > the Hadoop ecosystem. There is a lot of users using NiFi with absolutely
> no
> > relation to Hadoop. Not sure that would send the good "signal".
> >
> > Pierre
> >
> >
> >
> >
> > 2017-02-22 6:50 GMT+01:00 Andre :
> >
> >> Andrew,
> >>
> >>
> >> On Wed, Feb 22, 2017 at 11:21 AM, Andrew Grande 
> >> wrote:
> >>
> >>> I am observing one assumption in this thread. For some reason we are
> >>> implying all these will be hadoop compatible file systems. They don't
> >>> always have an HDFS plugin, nor should they as a mandatory requirement.
> >>>
> >>
> >> You are partially correct.
> >>
> >> There is a direct assumption in the availability of a HCFS (thanks
> Matt!)
> >> implementation.
> >>
> >> This is the case with:
> >>
> >> * Windows Azure Blob Storage
> >> * Google Cloud Storage Connector
> >> * MapR FileSystem (currently done via NAR recompilation / mvn profile)
> >> * Alluxio
> >> * Isilon 

Re: [DISCUSS] Scale-out/Object Storage - taming the diversity of processors

2017-02-22 Thread Oleg Zhurakousky
I’ll second Pierre

Yes with the current deployment model the amount of processors and the size of 
NiFi distribution is a concern simply because it’s growing with each release. 
But it should not be the driver to start jamming more functionality into 
existing processors which on the surface may look like related (even if they 
are).
Basically a processor should never be complex with regard to it being 
understood by the end user who is non-technical, so “specialization” is always 
takes precedence here since it limits “configuration” and thus making such 
processor simpler. It also helps with maintenance and management of such 
processor by the developer. Also, having multiple related processors will 
promote healthy competition where my MyputHDFS may for certain cases be 
better/faster then YourPutHDFS and why not have both?

The “artifact registry” (flow, extension, template etc) is the only answer here 
since it will remove the “proliferation” and the need for “taming” anything 
from the picture. With “artifact registry” one or one million processors, the 
NiFi size/state will always remain constant and small.

Cheers
Oleg
> On Feb 22, 2017, at 6:05 AM, Pierre Villard  
> wrote:
> 
> Hey guys,
> 
> Thanks for the thread Andre.
> 
> +1 to James' answer.
> 
> I understand the interest that would provide a single processor to connect
> to all the back ends... and we could document/improve the PutHDFS to ease
> such use but I really don't think that it will benefit the user experience.
> That may be interesting in some cases for some users but I don't think that
> would be a majority.
> 
> I believe NiFi is great for one reason: you have a lot of specialized
> processors that are really easy to use and efficient for what they've been
> designed for.
> 
> Let's ask ourselves the question the other way: with the NiFi registry on
> its way, what is the problem having multiple processors for each back end?
> I don't really see the issue here. OK we have a lot of processors (but I
> believe this is a good point for NiFi, for user experience, for
> advertising, etc. - maybe we should improve the processor listing though,
> but again, this will be part of the NiFi Registry work), it generates a
> heavy NiFi binary (but that will be solved with the registry), but that's
> all, no?
> 
> Also agree on the positioning aspect: IMO NiFi should not be highly tied to
> the Hadoop ecosystem. There is a lot of users using NiFi with absolutely no
> relation to Hadoop. Not sure that would send the good "signal".
> 
> Pierre
> 
> 
> 
> 
> 2017-02-22 6:50 GMT+01:00 Andre :
> 
>> Andrew,
>> 
>> 
>> On Wed, Feb 22, 2017 at 11:21 AM, Andrew Grande 
>> wrote:
>> 
>>> I am observing one assumption in this thread. For some reason we are
>>> implying all these will be hadoop compatible file systems. They don't
>>> always have an HDFS plugin, nor should they as a mandatory requirement.
>>> 
>> 
>> You are partially correct.
>> 
>> There is a direct assumption in the availability of a HCFS (thanks Matt!)
>> implementation.
>> 
>> This is the case with:
>> 
>> * Windows Azure Blob Storage
>> * Google Cloud Storage Connector
>> * MapR FileSystem (currently done via NAR recompilation / mvn profile)
>> * Alluxio
>> * Isilon (via HDFS)
>> * others
>> 
>> But I would't say this will apply to every other use storage system and in
>> certain cases may not even be necessary (e.g. Isilon scale-out storage may
>> be reached using its native HDFS compatible interfaces).
>> 
>> 
>> Untie completely from the Hadoop nar. This allows for effective minifi
>>> interaction without the weight of hadoop libs for example. Massive size
>>> savings where it matters.
>>> 
>>> 
>> Are you suggesting a use case were MiNiFi agents interact directly with
>> cloud storage, without relying on NiFi hubs to do that?
>> 
>> 
>>> For the deployment, it's easy enough for an admin to either rely on a
>>> standard tar or rpm if the NAR modules are already available in the
>> distro
>>> (well, I won't talk registry till it arrives). Mounting a common
>> directory
>>> on every node or distributing additional jars everywhere, plus configs,
>> and
>>> then keeping it consistent across is something which can be avoided by
>>> simpler packaging.
>>> 
>> 
>> As long the NAR or RPM supports your use-case, which is not the case of
>> people running NiFi with MapR-FS for example. For those, a recompilation is
>> required anyway. A flexible processor may remove the need to recompile (I
>> am currently playing with the classpath implication to MapR users).
>> 
>> Cheers
>> 



Re: [DISCUSS] Scale-out/Object Storage - taming the diversity of processors

2017-02-22 Thread Bryan Bende
I tend to agree with a lot of the points made by James and Pierre...

Given that the end user of NiFi is not always a developer, it seems
more user-friendly to have the specific processors and not have users
trying to come up with the right set of JARs and the right
configuration properties (although many power users can do this).

Since the processors we are talking about already exist, and many came
from great community contributions, I don't think we should get rid of
any of them.  If there are inconsistencies that can be improved, such
as some processors using EL and others not, then we should definitely
make those improvements.





On Wed, Feb 22, 2017 at 8:42 AM, Andre  wrote:
> Pierre,
>
>
>> I believe NiFi is great for one reason: you have a lot of specialized
>> processors that are really easy to use and efficient for what they've been
>> designed for.
>>
>
>> Let's ask ourselves the question the other way: with the NiFi registry on
>> its way, what is the problem having multiple processors for each back end?
>> I don't really see the issue here. OK we have a lot of processors (but I
>> believe this is a good point for NiFi, for user experience, for
>> advertising, etc. - maybe we should improve the processor listing though,
>> but again, this will be part of the NiFi Registry work), it generates a
>> heavy NiFi binary (but that will be solved with the registry), but that's
>> all, no?
>>
>
> The natural trade-off being fragmentation, code support and consistency?
>
> Simple example?
>
> ListS3 = Uses InputRequirement(Requirement.INPUT_FORBIDDEN)
> ListGCSBucket = INPUT_FORBIDDEN seems to be absent, however, expression
> language is disabled on most properties, suggesting design did not intend
> to have input. Simple bug (NIFI-3514), simple fix (PR#1526).
>
> Yes, no doubts, ListS3 presents S3's properties in clear fashion. Certainly
> ListGCSBucket represents  GCS metadata as attributes in a more specific way
> and this is handy, but that wouldn't be an unmanageable challenge.
>
> This is not an isolated issue, there are plenty of examples, some as simple
> as naming...  After all, one could be ultra pedantic for a second and note
> the ListGCSBucket does not follow the same convention as ListS3(*).
>
>
> Therefore, while the the examples above are overly trivial, they still
> serve as a clear reminder of a very WET vs DRY dilemma. I strongly believe
> we should strive to stay in DRY land.
>
>
> Note however, that I am 100% OK with the idea that using HCFS may be overly
> complex and possibly undesirable;
>
> Nonetheless I think we should at least consider Matt's suggestion of using
> some refactoring magic, or anything that can help us achieving programatic
> ways of promoting consistency across the common features of those
> processors (with the registry or not).
>
>
>
> I will take the community guidance on this.
>
> Cheers
>
> Andre
>
>
> (*) The closer conventional name would probably be ListGCS as no other
> ListProcessor seems to define the unit of collection, (i.e. it is ListSFTP
> not ListSFTPFolder).  I have not raised a JIRA ticket but I suggest the
> name to be changed for better user experience.


Re: [DISCUSS] Scale-out/Object Storage - taming the diversity of processors

2017-02-21 Thread Andrew Grande
I am observing one assumption in this thread. For some reason we are
implying all these will be hadoop compatible file systems. They don't
always have an HDFS plugin, nor should they as a mandatory requirement.
Untie completely from the Hadoop nar. This allows for effective minifi
interaction without the weight of hadoop libs for example. Massive size
savings where it matters.

For the deployment, it's easy enough for an admin to either rely on a
standard tar or rpm if the NAR modules are already available in the distro
(well, I won't talk registry till it arrives). Mounting a common directory
on every node or distributing additional jars everywhere, plus configs, and
then keeping it consistent across is something which can be avoided by
simpler packaging.

Andrew

On Tue, Feb 21, 2017, 6:47 PM Andre  wrote:

> Andrew,
>
> Thank you for contributing.
>
> On 22 Feb 2017 10:21 AM, "Andrew Grande"  wrote:
>
> Andre,
>
> I came across multiple NiFi use cases where going through the HDFS layer
> and the fs plugin may not be possible. I.e. when no HDFS layer present at
> all, so no NN to connect to.
>
>
> Not sure I understand what you mean.
>
>
> Another important aspect is operations. Current PutHDFS model with
> additional jar location, well, it kinda works, but I very much dislike it.
> Too many possibilities for a human error in addition to deployment pain,
> especially in a cluster.
>
>
> Fair enough. Would you mind expanding a bit on what sort of  challenges
> currently apply in terms of cluster deployment?
>
>
> Finally, native object storage processors have features which may not even
> apply to the HDFS layer. E.g. the Azure storage has Table storage, etc.
>
>
> This is a very valid point but I am sure exceptions (in this case a NoSQL
> DB operating under the umbrella term of "storage").
>
> I perhaps should have made it more explicit but the requirements are:
>
> - existence of a hadoop compatible interface
> - ability to handle files
>
> Again, thank you for the input, truly appreciated.
>
> Andre
>
> I agree consolidating various efforts is worthwhile, but only within a
> context of a specific storage solution. Not 'unifying' them into a single
> layer.
>
> Andrew
>
> On Tue, Feb 21, 2017, 6:10 PM Andre  wrote:
>
> > dev,
> >
> > I was having a chat with Pierre around PR#379 and we thought it would be
> > worth sharing this with the wider group:
> >
> >
> > I recently noticed that we merged a number of PRs and merges around
> > scale-out/cloud based object store into the master.
> >
> > Would it make sense to start considering adopting a pattern where
> > Put/Get/ListHDFS are used in tandem with implementations of the
> > hadoop.filesystem interfaces instead of creating new processors, except
> > where a particular deficiency/incompatibility in the hadoop.filesystem
> > implementation exists?
> >
> > Candidates for removal / non merge would be:
> >
> > - Alluxio (PR#379)
> > - WASB (PR#626)
> >  - Azure* (PR#399)
> > - *GCP (recently merged as PR#1482)
> > - *S3 (although this has been in code so it would have to be deprecated)
> >
> > The pattern would be pretty much the same as the one documented and
> > successfully deployed here:
> >
> > https://community.hortonworks.com/articles/71916/connecting-
> > to-azure-data-lake-from-a-nifi-dataflow.html
> >
> > Which means that in the case of Alluxio, one would use the properties
> > documented here:
> >
> > https://www.alluxio.com/docs/community/1.3/en/Running-
> > Hadoop-MapReduce-on-Alluxio.html
> >
> > While with Google Cloud Storage we would use the properties documented
> > here:
> >
> > https://cloud.google.com/hadoop/google-cloud-storage-connector
> >
> > I noticed that specific processors could have the ability to handle
> > particular properties to a filesystem, however I would like to believe
> the
> > same issue would plague hadoop users, and therefore is reasonable to
> > believe the Hadoop compatible implementations would have ways of exposing
> > those properties as well?
> >
> > In the case the properties are exposed, we perhaps simply adjust the
> *HDFS
> > processors to use dynamic properties to pass those to the underlying
> > module, therefore providing a way to explore particular settings of an
> > underlying storage platforms.
> >
> > Any opinion would be welcome
> >
> > PS-sent it again with proper subject label
> >
>


Re: [DISCUSS] Scale-out/Object Storage - taming the diversity of processors

2017-02-21 Thread Andre
Andrew,

Thank you for contributing.

On 22 Feb 2017 10:21 AM, "Andrew Grande"  wrote:

Andre,

I came across multiple NiFi use cases where going through the HDFS layer
and the fs plugin may not be possible. I.e. when no HDFS layer present at
all, so no NN to connect to.


Not sure I understand what you mean.


Another important aspect is operations. Current PutHDFS model with
additional jar location, well, it kinda works, but I very much dislike it.
Too many possibilities for a human error in addition to deployment pain,
especially in a cluster.


Fair enough. Would you mind expanding a bit on what sort of  challenges
currently apply in terms of cluster deployment?


Finally, native object storage processors have features which may not even
apply to the HDFS layer. E.g. the Azure storage has Table storage, etc.


This is a very valid point but I am sure exceptions (in this case a NoSQL
DB operating under the umbrella term of "storage").

I perhaps should have made it more explicit but the requirements are:

- existence of a hadoop compatible interface
- ability to handle files

Again, thank you for the input, truly appreciated.

Andre

I agree consolidating various efforts is worthwhile, but only within a
context of a specific storage solution. Not 'unifying' them into a single
layer.

Andrew

On Tue, Feb 21, 2017, 6:10 PM Andre  wrote:

> dev,
>
> I was having a chat with Pierre around PR#379 and we thought it would be
> worth sharing this with the wider group:
>
>
> I recently noticed that we merged a number of PRs and merges around
> scale-out/cloud based object store into the master.
>
> Would it make sense to start considering adopting a pattern where
> Put/Get/ListHDFS are used in tandem with implementations of the
> hadoop.filesystem interfaces instead of creating new processors, except
> where a particular deficiency/incompatibility in the hadoop.filesystem
> implementation exists?
>
> Candidates for removal / non merge would be:
>
> - Alluxio (PR#379)
> - WASB (PR#626)
>  - Azure* (PR#399)
> - *GCP (recently merged as PR#1482)
> - *S3 (although this has been in code so it would have to be deprecated)
>
> The pattern would be pretty much the same as the one documented and
> successfully deployed here:
>
> https://community.hortonworks.com/articles/71916/connecting-
> to-azure-data-lake-from-a-nifi-dataflow.html
>
> Which means that in the case of Alluxio, one would use the properties
> documented here:
>
> https://www.alluxio.com/docs/community/1.3/en/Running-
> Hadoop-MapReduce-on-Alluxio.html
>
> While with Google Cloud Storage we would use the properties documented
> here:
>
> https://cloud.google.com/hadoop/google-cloud-storage-connector
>
> I noticed that specific processors could have the ability to handle
> particular properties to a filesystem, however I would like to believe the
> same issue would plague hadoop users, and therefore is reasonable to
> believe the Hadoop compatible implementations would have ways of exposing
> those properties as well?
>
> In the case the properties are exposed, we perhaps simply adjust the *HDFS
> processors to use dynamic properties to pass those to the underlying
> module, therefore providing a way to explore particular settings of an
> underlying storage platforms.
>
> Any opinion would be welcome
>
> PS-sent it again with proper subject label
>


Re: [DISCUSS] Scale-out/Object Storage - taming the diversity of processors

2017-02-21 Thread Matt Burgess
I agree with Andrew in the operations sense, and would like to add
that the user experience around dynamic properties (and even
"conditional" properties that are not dynamic but can be exposed when
other properties are "Applied") can be less-than-ideal and IMHO should
be used sparingly. Full disclosure: My latest processor uses
"conditional" properties at the moment, choosing them over dynamic
properties in the hopes that the user experience is better, but
without in-place updates (possibly implemented under [1]) and/or the
UI making it obvious that dynamic properties are supported (under
[2]), I'm not sure which is better (or if I should create different
processors for my case as well).

Under the hood, if it makes sense to group these processors and
abstract away common code, then I'm all for it.  Especially if we can
use something like the nifi-hadoop-libraries-nar as an ancestor NAR to
provide a common set of libraries to all the Hadoop-Compatible File
System (HCFS) implementations.  However I fear based on versions of
the specific HCFS implementations, they may also need different
versions of the HFS client dependencies, in which case we'd be looking
for the Extension Registry and some smart classloading to alleviate
those pain points without ballooning the NiFi footprint.

Regards,
Matt

[1] https://issues.apache.org/jira/browse/NIFI-1121
[2] https://issues.apache.org/jira/browse/NIFI-2629


On Tue, Feb 21, 2017 at 6:21 PM, Andrew Grande  wrote:
> Andre,
>
> I came across multiple NiFi use cases where going through the HDFS layer
> and the fs plugin may not be possible. I.e. when no HDFS layer present at
> all, so no NN to connect to.
>
> Another important aspect is operations. Current PutHDFS model with
> additional jar location, well, it kinda works, but I very much dislike it.
> Too many possibilities for a human error in addition to deployment pain,
> especially in a cluster.
>
> Finally, native object storage processors have features which may not even
> apply to the HDFS layer. E.g. the Azure storage has Table storage, etc.
>
> I agree consolidating various efforts is worthwhile, but only within a
> context of a specific storage solution. Not 'unifying' them into a single
> layer.
>
> Andrew
>
> On Tue, Feb 21, 2017, 6:10 PM Andre  wrote:
>
>> dev,
>>
>> I was having a chat with Pierre around PR#379 and we thought it would be
>> worth sharing this with the wider group:
>>
>>
>> I recently noticed that we merged a number of PRs and merges around
>> scale-out/cloud based object store into the master.
>>
>> Would it make sense to start considering adopting a pattern where
>> Put/Get/ListHDFS are used in tandem with implementations of the
>> hadoop.filesystem interfaces instead of creating new processors, except
>> where a particular deficiency/incompatibility in the hadoop.filesystem
>> implementation exists?
>>
>> Candidates for removal / non merge would be:
>>
>> - Alluxio (PR#379)
>> - WASB (PR#626)
>>  - Azure* (PR#399)
>> - *GCP (recently merged as PR#1482)
>> - *S3 (although this has been in code so it would have to be deprecated)
>>
>> The pattern would be pretty much the same as the one documented and
>> successfully deployed here:
>>
>> https://community.hortonworks.com/articles/71916/connecting-
>> to-azure-data-lake-from-a-nifi-dataflow.html
>>
>> Which means that in the case of Alluxio, one would use the properties
>> documented here:
>>
>> https://www.alluxio.com/docs/community/1.3/en/Running-
>> Hadoop-MapReduce-on-Alluxio.html
>>
>> While with Google Cloud Storage we would use the properties documented
>> here:
>>
>> https://cloud.google.com/hadoop/google-cloud-storage-connector
>>
>> I noticed that specific processors could have the ability to handle
>> particular properties to a filesystem, however I would like to believe the
>> same issue would plague hadoop users, and therefore is reasonable to
>> believe the Hadoop compatible implementations would have ways of exposing
>> those properties as well?
>>
>> In the case the properties are exposed, we perhaps simply adjust the *HDFS
>> processors to use dynamic properties to pass those to the underlying
>> module, therefore providing a way to explore particular settings of an
>> underlying storage platforms.
>>
>> Any opinion would be welcome
>>
>> PS-sent it again with proper subject label
>>


Re: [DISCUSS] Scale-out/Object Storage - taming the diversity of processors

2017-02-21 Thread Andrew Grande
Andre,

I came across multiple NiFi use cases where going through the HDFS layer
and the fs plugin may not be possible. I.e. when no HDFS layer present at
all, so no NN to connect to.

Another important aspect is operations. Current PutHDFS model with
additional jar location, well, it kinda works, but I very much dislike it.
Too many possibilities for a human error in addition to deployment pain,
especially in a cluster.

Finally, native object storage processors have features which may not even
apply to the HDFS layer. E.g. the Azure storage has Table storage, etc.

I agree consolidating various efforts is worthwhile, but only within a
context of a specific storage solution. Not 'unifying' them into a single
layer.

Andrew

On Tue, Feb 21, 2017, 6:10 PM Andre  wrote:

> dev,
>
> I was having a chat with Pierre around PR#379 and we thought it would be
> worth sharing this with the wider group:
>
>
> I recently noticed that we merged a number of PRs and merges around
> scale-out/cloud based object store into the master.
>
> Would it make sense to start considering adopting a pattern where
> Put/Get/ListHDFS are used in tandem with implementations of the
> hadoop.filesystem interfaces instead of creating new processors, except
> where a particular deficiency/incompatibility in the hadoop.filesystem
> implementation exists?
>
> Candidates for removal / non merge would be:
>
> - Alluxio (PR#379)
> - WASB (PR#626)
>  - Azure* (PR#399)
> - *GCP (recently merged as PR#1482)
> - *S3 (although this has been in code so it would have to be deprecated)
>
> The pattern would be pretty much the same as the one documented and
> successfully deployed here:
>
> https://community.hortonworks.com/articles/71916/connecting-
> to-azure-data-lake-from-a-nifi-dataflow.html
>
> Which means that in the case of Alluxio, one would use the properties
> documented here:
>
> https://www.alluxio.com/docs/community/1.3/en/Running-
> Hadoop-MapReduce-on-Alluxio.html
>
> While with Google Cloud Storage we would use the properties documented
> here:
>
> https://cloud.google.com/hadoop/google-cloud-storage-connector
>
> I noticed that specific processors could have the ability to handle
> particular properties to a filesystem, however I would like to believe the
> same issue would plague hadoop users, and therefore is reasonable to
> believe the Hadoop compatible implementations would have ways of exposing
> those properties as well?
>
> In the case the properties are exposed, we perhaps simply adjust the *HDFS
> processors to use dynamic properties to pass those to the underlying
> module, therefore providing a way to explore particular settings of an
> underlying storage platforms.
>
> Any opinion would be welcome
>
> PS-sent it again with proper subject label
>


[DISCUSS] Scale-out/Object Storage - taming the diversity of processors

2017-02-21 Thread Andre
dev,

I was having a chat with Pierre around PR#379 and we thought it would be
worth sharing this with the wider group:


I recently noticed that we merged a number of PRs and merges around
scale-out/cloud based object store into the master.

Would it make sense to start considering adopting a pattern where
Put/Get/ListHDFS are used in tandem with implementations of the
hadoop.filesystem interfaces instead of creating new processors, except
where a particular deficiency/incompatibility in the hadoop.filesystem
implementation exists?

Candidates for removal / non merge would be:

- Alluxio (PR#379)
- WASB (PR#626)
 - Azure* (PR#399)
- *GCP (recently merged as PR#1482)
- *S3 (although this has been in code so it would have to be deprecated)

The pattern would be pretty much the same as the one documented and
successfully deployed here:

https://community.hortonworks.com/articles/71916/connecting-
to-azure-data-lake-from-a-nifi-dataflow.html

Which means that in the case of Alluxio, one would use the properties
documented here:

https://www.alluxio.com/docs/community/1.3/en/Running-
Hadoop-MapReduce-on-Alluxio.html

While with Google Cloud Storage we would use the properties documented here:

https://cloud.google.com/hadoop/google-cloud-storage-connector

I noticed that specific processors could have the ability to handle
particular properties to a filesystem, however I would like to believe the
same issue would plague hadoop users, and therefore is reasonable to
believe the Hadoop compatible implementations would have ways of exposing
those properties as well?

In the case the properties are exposed, we perhaps simply adjust the *HDFS
processors to use dynamic properties to pass those to the underlying
module, therefore providing a way to explore particular settings of an
underlying storage platforms.

Any opinion would be welcome

PS-sent it again with proper subject label