Re: [DISCUSS] Proposal for an Apache NiFi sub project - MiNiFi

Joe Witt Sun, 10 Jan 2016 06:08:00 -0800

Andre

Wow!  Lots of good points/considerations in there.  Let me try to
respond to a few and these are good things we should be sure to have
included in a more formal proposal document.


Centralized Management:

  Yeah this would not limit or compete with the use of automated
deployment tools like you mentioned.  However, what it is envisioned
to do is go beyond what those will provide and into addressing
application/domain specific support for things like altering the
running behavior of the data flow.  Let those deployment management
tools like you mentioned deploy, maybe upgrade, tune system specific
aspects.  But let a well defined MiNiFi agent interface describe how
runtime information may be exchange which will alter the dataflows.

Provenance/Trust:

  Great point here about the value of provenance generated in
untrusted environments and potential lighter weight alternatives.  I
think though if we're concerned about the integrity and
non-repudiation of provenance we should tackle that as a holistic
concern.  With the evolution of data centers and public/private clouds
and more the whole idea of trusted zones is evolving.  There are
interesting 'privacy by design' sites/papers that we should look into
and consider for how to provide integrity and non-repudiation for
provenance end-to-end.

The weight of the JVM/Code Maintenance:

  I agree that we must be mindful about code maintenance.  A JVM
implementation will be useful as a relay/gateway or on devices where
it can be more of the star of the show.  For other systems with
existing processes and for which this must be a sort of silent partner
I think a native implementation will be the winner.  Lots of good
examples of existing agents out there that as you mentioned we should
study to consider tradeoffs/ideas.

Only have responded to a small portion of the great points and
considerations you bring up.  Look forward to discussing and forming
this further.

Thanks
Joe

On Sun, Jan 10, 2016 at 7:18 AM, Andre <[email protected]> wrote:
> Joe,
>
> What an awesome way to start of the week!
>
> NiFi is a great platform and it is truly exciting to move in this
> direction. However, I  have a few questions / comments.
>
> Before I make these comments I would like to point my comments are mostly
> based in my recent experience selecting a last mile agent. From memory I
> think I tested:
>
> 1. splunk ($ + JVM?),
> 2. logstash (jruby),
> 3. logstash-forwarder (golang),
> 4. flume (java),
> 5. fluentd (ruby),
> 6. heka(golang),
> 7. hindsight (c++?).
>
> and also had a look at:
>
> 8. log-courier(golang?), and
> 9. filebeat (golang?),
>
> Interestingly enough they all seem to lack centralised management...
>
> As far as I understand, from the 9, at least one - flume 0.9 - had such
> capabilities but decided to steer away from it (the move from Flume to
> flume-ng lead to the disappearance of centralised Flume Masters).
>
> To the best of my knowledge this was motivated by the need to simplify
> deployment.
>
>
> The way I read this is that centralised management sounds like a great idea
> for small environments but in larger environments it always will boil down
> to how many agents do I really want running on my resource constrained
> systems?
>
> Therefore, from a purely personal point of view I rather integrate
> configuration with ansible, puppet, whatever than having yet another system
> to manage resource mapping, firewalls ports and all that jazz. But maybe
> that's t just me as I had past experiences having to mediate a mindless
> debate if McAfee software should be deployed using its own EPO agent,
> Altiris or SCCM.
>
>
> Obviously the existence of systems listed above does not preclude the need
> to MiNiFi and nobody is obligated to use any particular agent.
>
> Furthermore I am almost sure most of us will agree provenance should cover
> the last mile as well and most of the systems above are unable to handle
> this area properly.
>
> However, one thing that gives me the creeps is the code maintenance
> challenge:
>
>
> Logstash, being written in jruby always suffered with the JVM resource hog
> stereotype so at one stage the Logstash project decided to code
> logstash-forwarder, a lightweight agent written in golang.
>
> All worked greatly fine until people started realising the agent had some
> issues but as logstash jruby code and user base grew, so did the importance
> of the jruby bugs. Logsatsh-forward bugs started to fall through the cracks.
>
> At one point that log-courier forked the code base and started yet another
> project to address some of these bugs. now event elastic.co decided to
> follow that path and launched filebeat to replace the logstash-forwarder
> (source: https://goo.gl/opmaqs ).
>
> Their story tells us two lessons:
>
> People don't seem to enjoy running JVM where JVM isn't strictly necessary
> and maintaining code bases in two languages is challenging. So it is
> something that concerns me (although I talk lots and code little I want the
> project to be as efficient as possible! :D  )
>
>
> But enough with doom and let me share what in my opinion would be parts of
> a freaking killer platform:
>
>
> 1. keep JVM away from the end point.
> The difference between resource consumption of my early flume and logstash
> test-beds in comparison with the heka and hindsight environments continues
> to haunt me to this day...
>
> 2. Support Linked Time-stamping / KSI.
>
> It may just be me but I find very little sense to provide strong provenance
> assurances using a system that runs within an "untrusted" environment. I
> mean, how could I trust a MiNiFi instance running within a CPE more than I
> can trust a random agent connecting to my NiFi servers?
>
> Because of this inherited lack of trust, I usually tend to see end-point
> provenance more as lineage than a chain of custody and settle for a good
> enough lineage. This means I am happy to accept a agent provided piece of
> information (IP address, Key-Value attached to a message [*]) as evidence
> of the generator of the information.
>
> Still this approach has a limitation; I is hard to determine when the
> information was created.
>
> Linked timestamp solutions like the ones implemented by the KSI folks at
> Guardtime seems to be a good response to this challenge.
>
>
> 3. Support docker logging drivers
>
> After all the world outside is starting to resemble a container shipyard...
>
> 4. Support scripting that can be used to expand local needs without causing
> bloat to main codebase.
>
> Heka and Hindsight use of Lua Sandbox are prime examples of how a good
> scripting engine can allow team to cater for local needs without having to
> recompile code, etc. IMHO the sandbox works better than bundles as we have
> now.
>
>
>
> Once again, thank you for the news. Looking forward to read more about
> MiNiFi in the next months.
>
>
> Cheers,
>
> Andre
>
>
>
> [*] the lumberjack processor is already doing this. I hope the early stable
> version of the code to be ready sometime this week (apologies for the delay
> but I had to take a indulged myself with a break this year! 8D ).
>
> On Sun, Jan 10, 2016 at 11:29 AM, Joe Witt <[email protected]> wrote:
>
>> NiFi Community,
>>
>> I'd like to initiate discussion around a proposal to create our first
>> sub-project of NiFi.  A possible name for it is "MiNiFi" a sort of
>> play on Mini-NiFi.
>>
>> The idea is to provide a complementary data collection agent to NiFi's
>> current approach of dataflow management.  As noted in our ASF TLP
>> resolution NiFi is to provide "an automated and durable data broker
>> between systems providing interactive command and control and detailed
>> chain of custody for data."  MiNiFi would be consistent with that
>> scope with a  specific focus on the first-mile challenge so common in
>> dataflow.
>>
>> Specific goals of MiNiFi would be to provide a small, lightweight,
>> centrally managed  agent that natively generates data provenance and
>> seamlessly integrates with NiFi for follow-on dataflow management and
>> maintenance of the chain of custody provided by the powerful data
>> provenance features of NiFi.
>>
>> MiNiFi should be designed to operate directly on or adjacent to the
>> source sensor, system, server generating the events as a resource
>> sensitive tenant.  There are numerous agent models in existence today
>> but they do not offer the command and control or provenance that is so
>> important to the philosophy and scope of NiFi.
>>
>> These agents would necessarily have a different interactive command
>> and control model than NiFi as you'd not expect consistent behavior,
>> capability, or accessibility of all instances of the agents at any
>> given time.
>>
>> Multiple implementations of MiNiFi are envisioned including those that
>> operate on the JVM and those that do not.
>>
>> As the discussion advances we can put together wiki pages, concept
>> diagrams, and requirements to help better articulate how this might
>> evolve.  We should also discuss the mechanics of how this might work
>> in terms of infrastructure, code repository, and more.
>>
>> Thanks
>> Joe
>>

Re: [DISCUSS] Proposal for an Apache NiFi sub project - MiNiFi

Reply via email to