"I look at MiNiFi C++ as a direct spoke of a NiFi hub and as such it really can be treated as one "NiFi" instance."
[joe]Yes but there can be other hubs too and in parallel. For example, it is quite common for an edge collection location to write events to a local message bus for local usage while at the same time send the feed to a central NiFi instance. We should avoid introducing a single exfil point limitation especially when the primary reason would be to simplify the concept of provenance. The whole point of provenance is to capture and embrace what really happens in end to end flows. "Additionally, since MiNiFi C++ is a complete rewrite, as has been previously discussed, making requirement variations from NiFi or MiNiFi Java is acceptable, in my opinion." [joe]You've mentioned this a couple times now. I don't think anyone is making a case here on the basis that we don't want to change it because we want to avoid requirement variations. The discussion is purely on merit of the ideas. We should always be open to requirement changes. "As such, there is no value in having separate provenance for MiNiFi C++ and NiFi since it is one cradle to grave path (that happens to use both)." [joe]I'm not quite sure I understand so please elaborate if my comments don't apply. There is no such thing as 'separate provenance' really. The bottom line is that capturing facts about what happens to a piece of data at a point in its lifecycle happens all over the end-to-end chain. These things ultimately when wired together conceptually form a representation of the graph of how data flowed. Ultimately a single instance of MiNiFi only knows about the events that happened on its watch. Same is true for an instance of NiFi. In the end, there are various places where provenance gets generated and then you get to the scenario of "how do i see the end to end chain". This requires something even beyond any NiFi itself. Apache Atlas (incubating) might be an answer but there may be others. This is why we have facilities like reporting tasks to send provenance events to some place. This is often just sending to HDFS so all events are in one place for retention and analysis. The concept of provenance is bigger than NiFi or MiNiFi to be clear. And, at this point we do not have any plans or designs for having a NiFi cluster take ownership of other systems provenance events (even if those other systems are NiFi or MiNiFi agents). We can certainly act as a relay point for such information but to index them and properly represent them in the context of who owns them is another matter. Frankly, if you get into the deep weeds of provenance you can get into some fun discussions about data identity. When I am systemX and have object Y and send it to sytemZ did I send object X or did I sent some object X2? If you think I sent X then what happens to X if it is altered on the other system? We can't now both be talking about X but talking about different versions. Etc.. "I personally don't see this as an attribute as currently represented in the flowfiles since that would not be an efficient structure to handle or maintain through MiNiFi C++ pathing. This requires the provenance tree related to that flowfile to be sent (which should be small-ish in a MiNiFi C++ instance). My design for it was that it would be a separate data point on the flowfile package using a simple, extremely lightweight, and easy to manipulate structure. Truthfully, it doesn't even have to be resident all through the MiNiFi C++ flow if a viable repo replaces LevelDB and my preference is to add it in at the S2S processor. The important thing is that it can be sent with the flowfile through S2S and then added to the main NiFi provenance repo so as to provide a continuous chain. This would be easy to toggle through a single checkbox added to a MiNiFi C++ S2S variant so that if you choose not to integrate as provenance isn't important to you, you could." [joe]Ok so I think what you're saying is that you'd have a sort of hybrid out of band model where it is brought in-band during site to site transfers. I see how that helps and that is certainly fine as a transport. I'm not sure how expensive it would be to collect the provenance trail during transfer but of course the provenance repository for MiNiFi could be optimized for that. Also, we still have to consider that MiNiFi isn't limited to just being tethered to a single NiFi instance so we'd need to be clear that there could be additional provenance we're not getting via this path and if it came in via other paths we'd have to have a way to resolve this. "Since in this model, MiNiFi C++ plus provenance only integrates with NiFi hubs, there is no reason to concern with outside compatibility for this specific S2S processor mechanism." [joe]It is really important to propose and advocate a model for provenance that honors the existing plan and model for MiNiFi and NiFi. Or, if we should discuss altering that model we should do that on a separate thread and we should also have good reasons to limit it from what is planned today and ideally for more reasons that just making provenance more clear. It was definitely built with the understanding of edge use cases requiring more than a single exfil path. Thanks Joe On Tue, Nov 29, 2016 at 8:02 AM, Daniel Cave <[email protected]> wrote: > As to Joe and Aldrin's concerns, I feel a bit more detail of what I had in > mind might clear up some of the concerns and vagaries (all valid) that you > mentioned. > > As Aldrin mentioned, to me provenance is not about metadata needed for > routing. I don't doubt there are use cases for that, as Randy mentioned, > however it was not the concern I had in mind that I am looking to address > with this discussion. If the community wants to add more functionality from > a metadata also, we can certainly add that. > > As for Joe's examples and concerns for in-band, I look at MiNiFi C++ as a > direct spoke of a NiFi hub and as such it really can be treated as one > "NiFi" instance. Additionally, since MiNiFi C++ is a complete rewrite, as > has been previously discussed, making requirement variations from NiFi or > MiNiFi Java is acceptable, in my opinion. As such, there is no value in > having separate provenance for MiNiFi C++ and NiFi since it is one cradle to > grave path (that happens to use both). As for bandwidth concerns, this is > actually exactly one of the issues that concerns me as later calling to the > MiNiFi C++ enabled device merely to sort and retrieve provenance (which > would be a heavy operation as currently constructed) is not realistic. One > of the biggest selling points of NiFi is its full data provenance ability, > and my goal is merely to extend it through the full "flow". I personally > don't see this as an attribute as currently represented in the flowfiles > since that would not be an efficient structure to handle or maintain through > MiNiFi C++ pathing. This requires the provenance tree related to that > flowfile to be sent (which should be small-ish in a MiNiFi C++ instance). > My design for it was that it would be a separate data point on the flowfile > package using a simple, extremely lightweight, and easy to manipulate > structure. Truthfully, it doesn't even have to be resident all through the > MiNiFi C++ flow if a viable repo replaces LevelDB and my preference is to > add it in at the S2S processor. The important thing is that it can be sent > with the flowfile through S2S and then added to the main NiFi provenance > repo so as to provide a continuous chain. This would be easy to toggle > through a single checkbox added to a MiNiFi C++ S2S variant so that if you > choose not to integrate as provenance isn't important to you, you could. > Since in this model, MiNiFi C++ plus provenance only integrates with NiFi > hubs, there is no reason to concern with outside compatibility for this > specific S2S processor mechanism. > > I see the ability to allow for "in-band" communication at the S2S-S2S point > as a requirement for some use cases. > > > > -- > View this message in context: > http://apache-nifi-developer-list.39713.n7.nabble.com/MiNiFi-C-Data-Provenance-and-Related-Issues-tp14024p14045.html > Sent from the Apache NiFi Developer List mailing list archive at Nabble.com.
