Joe, What an awesome way to start of the week!
NiFi is a great platform and it is truly exciting to move in this direction. However, I have a few questions / comments. Before I make these comments I would like to point my comments are mostly based in my recent experience selecting a last mile agent. From memory I think I tested: 1. splunk ($ + JVM?), 2. logstash (jruby), 3. logstash-forwarder (golang), 4. flume (java), 5. fluentd (ruby), 6. heka(golang), 7. hindsight (c++?). and also had a look at: 8. log-courier(golang?), and 9. filebeat (golang?), Interestingly enough they all seem to lack centralised management... As far as I understand, from the 9, at least one - flume 0.9 - had such capabilities but decided to steer away from it (the move from Flume to flume-ng lead to the disappearance of centralised Flume Masters). To the best of my knowledge this was motivated by the need to simplify deployment. The way I read this is that centralised management sounds like a great idea for small environments but in larger environments it always will boil down to how many agents do I really want running on my resource constrained systems? Therefore, from a purely personal point of view I rather integrate configuration with ansible, puppet, whatever than having yet another system to manage resource mapping, firewalls ports and all that jazz. But maybe that's t just me as I had past experiences having to mediate a mindless debate if McAfee software should be deployed using its own EPO agent, Altiris or SCCM. Obviously the existence of systems listed above does not preclude the need to MiNiFi and nobody is obligated to use any particular agent. Furthermore I am almost sure most of us will agree provenance should cover the last mile as well and most of the systems above are unable to handle this area properly. However, one thing that gives me the creeps is the code maintenance challenge: Logstash, being written in jruby always suffered with the JVM resource hog stereotype so at one stage the Logstash project decided to code logstash-forwarder, a lightweight agent written in golang. All worked greatly fine until people started realising the agent had some issues but as logstash jruby code and user base grew, so did the importance of the jruby bugs. Logsatsh-forward bugs started to fall through the cracks. At one point that log-courier forked the code base and started yet another project to address some of these bugs. now event elastic.co decided to follow that path and launched filebeat to replace the logstash-forwarder (source: https://goo.gl/opmaqs ). Their story tells us two lessons: People don't seem to enjoy running JVM where JVM isn't strictly necessary and maintaining code bases in two languages is challenging. So it is something that concerns me (although I talk lots and code little I want the project to be as efficient as possible! :D ) But enough with doom and let me share what in my opinion would be parts of a freaking killer platform: 1. keep JVM away from the end point. The difference between resource consumption of my early flume and logstash test-beds in comparison with the heka and hindsight environments continues to haunt me to this day... 2. Support Linked Time-stamping / KSI. It may just be me but I find very little sense to provide strong provenance assurances using a system that runs within an "untrusted" environment. I mean, how could I trust a MiNiFi instance running within a CPE more than I can trust a random agent connecting to my NiFi servers? Because of this inherited lack of trust, I usually tend to see end-point provenance more as lineage than a chain of custody and settle for a good enough lineage. This means I am happy to accept a agent provided piece of information (IP address, Key-Value attached to a message [*]) as evidence of the generator of the information. Still this approach has a limitation; I is hard to determine when the information was created. Linked timestamp solutions like the ones implemented by the KSI folks at Guardtime seems to be a good response to this challenge. 3. Support docker logging drivers After all the world outside is starting to resemble a container shipyard... 4. Support scripting that can be used to expand local needs without causing bloat to main codebase. Heka and Hindsight use of Lua Sandbox are prime examples of how a good scripting engine can allow team to cater for local needs without having to recompile code, etc. IMHO the sandbox works better than bundles as we have now. Once again, thank you for the news. Looking forward to read more about MiNiFi in the next months. Cheers, Andre [*] the lumberjack processor is already doing this. I hope the early stable version of the code to be ready sometime this week (apologies for the delay but I had to take a indulged myself with a break this year! 8D ). On Sun, Jan 10, 2016 at 11:29 AM, Joe Witt <[email protected]> wrote: > NiFi Community, > > I'd like to initiate discussion around a proposal to create our first > sub-project of NiFi. A possible name for it is "MiNiFi" a sort of > play on Mini-NiFi. > > The idea is to provide a complementary data collection agent to NiFi's > current approach of dataflow management. As noted in our ASF TLP > resolution NiFi is to provide "an automated and durable data broker > between systems providing interactive command and control and detailed > chain of custody for data." MiNiFi would be consistent with that > scope with a specific focus on the first-mile challenge so common in > dataflow. > > Specific goals of MiNiFi would be to provide a small, lightweight, > centrally managed agent that natively generates data provenance and > seamlessly integrates with NiFi for follow-on dataflow management and > maintenance of the chain of custody provided by the powerful data > provenance features of NiFi. > > MiNiFi should be designed to operate directly on or adjacent to the > source sensor, system, server generating the events as a resource > sensitive tenant. There are numerous agent models in existence today > but they do not offer the command and control or provenance that is so > important to the philosophy and scope of NiFi. > > These agents would necessarily have a different interactive command > and control model than NiFi as you'd not expect consistent behavior, > capability, or accessibility of all instances of the agents at any > given time. > > Multiple implementations of MiNiFi are envisioned including those that > operate on the JVM and those that do not. > > As the discussion advances we can put together wiki pages, concept > diagrams, and requirements to help better articulate how this might > evolve. We should also discuss the mechanics of how this might work > in terms of infrastructure, code repository, and more. > > Thanks > Joe >
