Re: [DISCUSS] Proposal for an Apache NiFi sub project - MiNiFi

Andre Sun, 10 Jan 2016 04:19:50 -0800

Joe,

What an awesome way to start of the week!

NiFi is a great platform and it is truly exciting to move in this
direction. However, I  have a few questions / comments.

Before I make these comments I would like to point my comments are mostly
based in my recent experience selecting a last mile agent. From memory I
think I tested:

1. splunk ($ + JVM?),
2. logstash (jruby),
3. logstash-forwarder (golang),
4. flume (java),
5. fluentd (ruby),
6. heka(golang),
7. hindsight (c++?).

and also had a look at:

8. log-courier(golang?), and
9. filebeat (golang?),

Interestingly enough they all seem to lack centralised management...

As far as I understand, from the 9, at least one - flume 0.9 - had such
capabilities but decided to steer away from it (the move from Flume to
flume-ng lead to the disappearance of centralised Flume Masters).

To the best of my knowledge this was motivated by the need to simplify
deployment.

The way I read this is that centralised management sounds like a great idea
for small environments but in larger environments it always will boil down
to how many agents do I really want running on my resource constrained
systems?

Therefore, from a purely personal point of view I rather integrate
configuration with ansible, puppet, whatever than having yet another system
to manage resource mapping, firewalls ports and all that jazz. But maybe
that's t just me as I had past experiences having to mediate a mindless
debate if McAfee software should be deployed using its own EPO agent,
Altiris or SCCM.

Obviously the existence of systems listed above does not preclude the need
to MiNiFi and nobody is obligated to use any particular agent.

Furthermore I am almost sure most of us will agree provenance should cover
the last mile as well and most of the systems above are unable to handle
this area properly.

However, one thing that gives me the creeps is the code maintenance
challenge:

Logstash, being written in jruby always suffered with the JVM resource hog
stereotype so at one stage the Logstash project decided to code
logstash-forwarder, a lightweight agent written in golang.

All worked greatly fine until people started realising the agent had some
issues but as logstash jruby code and user base grew, so did the importance
of the jruby bugs. Logsatsh-forward bugs started to fall through the cracks.

At one point that log-courier forked the code base and started yet another
project to address some of these bugs. now event elastic.co decided to
follow that path and launched filebeat to replace the logstash-forwarder
(source: https://goo.gl/opmaqs ).

Their story tells us two lessons:

People don't seem to enjoy running JVM where JVM isn't strictly necessary
and maintaining code bases in two languages is challenging. So it is
something that concerns me (although I talk lots and code little I want the
project to be as efficient as possible! :D  )

But enough with doom and let me share what in my opinion would be parts of
a freaking killer platform:

1. keep JVM away from the end point.
The difference between resource consumption of my early flume and logstash
test-beds in comparison with the heka and hindsight environments continues
to haunt me to this day...

2. Support Linked Time-stamping / KSI.

It may just be me but I find very little sense to provide strong provenance
assurances using a system that runs within an "untrusted" environment. I
mean, how could I trust a MiNiFi instance running within a CPE more than I
can trust a random agent connecting to my NiFi servers?

Because of this inherited lack of trust, I usually tend to see end-point
provenance more as lineage than a chain of custody and settle for a good
enough lineage. This means I am happy to accept a agent provided piece of
information (IP address, Key-Value attached to a message [*]) as evidence
of the generator of the information.

Still this approach has a limitation; I is hard to determine when the
information was created.

Linked timestamp solutions like the ones implemented by the KSI folks at
Guardtime seems to be a good response to this challenge.

3. Support docker logging drivers

After all the world outside is starting to resemble a container shipyard...

4. Support scripting that can be used to expand local needs without causing
bloat to main codebase.

Heka and Hindsight use of Lua Sandbox are prime examples of how a good
scripting engine can allow team to cater for local needs without having to
recompile code, etc. IMHO the sandbox works better than bundles as we have
now.

Once again, thank you for the news. Looking forward to read more about
MiNiFi in the next months.

Cheers,

Andre

[*] the lumberjack processor is already doing this. I hope the early stable
version of the code to be ready sometime this week (apologies for the delay
but I had to take a indulged myself with a break this year! 8D ).

On Sun, Jan 10, 2016 at 11:29 AM, Joe Witt <[email protected]> wrote:

> NiFi Community,
>
> I'd like to initiate discussion around a proposal to create our first
> sub-project of NiFi.  A possible name for it is "MiNiFi" a sort of
> play on Mini-NiFi.
>
> The idea is to provide a complementary data collection agent to NiFi's
> current approach of dataflow management.  As noted in our ASF TLP
> resolution NiFi is to provide "an automated and durable data broker
> between systems providing interactive command and control and detailed
> chain of custody for data."  MiNiFi would be consistent with that
> scope with a  specific focus on the first-mile challenge so common in
> dataflow.
>
> Specific goals of MiNiFi would be to provide a small, lightweight,
> centrally managed  agent that natively generates data provenance and
> seamlessly integrates with NiFi for follow-on dataflow management and
> maintenance of the chain of custody provided by the powerful data
> provenance features of NiFi.
>
> MiNiFi should be designed to operate directly on or adjacent to the
> source sensor, system, server generating the events as a resource
> sensitive tenant.  There are numerous agent models in existence today
> but they do not offer the command and control or provenance that is so
> important to the philosophy and scope of NiFi.
>
> These agents would necessarily have a different interactive command
> and control model than NiFi as you'd not expect consistent behavior,
> capability, or accessibility of all instances of the agents at any
> given time.
>
> Multiple implementations of MiNiFi are envisioned including those that
> operate on the JVM and those that do not.
>
> As the discussion advances we can put together wiki pages, concept
> diagrams, and requirements to help better articulate how this might
> evolve.  We should also discuss the mechanics of how this might work
> in terms of infrastructure, code repository, and more.
>
> Thanks
> Joe
>

Re: [DISCUSS] Proposal for an Apache NiFi sub project - MiNiFi

Reply via email to