Hi Mike, I agree with the approach that enrich provenance events. In order to do so, we can use several places to embed meta-data:
- FlowFile attributes: automatically mapped to a provenance event, but as Andy mentioned, we need to be careful not to put sensitive data. - Transit URI: when I developed NiFi Atlas integration, I used this as the primary source of what data a processor interact with. E.g. remote address, database, table ... etc. - The 'details' string. It might not be ideal solution, but ProvenanceReporter accepts additional 'details' string. We can embed whatever we want here. I'd map meta-data you mentioned as follows: 1. Source system. => Transit URI 2. Database/table/index/collection/etc. => Transit URI or FlowFile attribute. I think it's fine to put these into attribute. 3. The lookup criteria that was used (similar to the "query attribute" some already have). => 'details' string What I learned from Atlas integration, it's really hard to design a complete standard set of attributes. I'd suggest use what NiFi framework provides currently. Thanks, Koji On Tue, May 15, 2018 at 8:15 AM, Andy LoPresto <[email protected]> wrote: > Maybe an ADDINFO event or FORK event could be used and a new flowfile with > the relevant attributes/content could be created. The flowfiles would be > linked, but the “sensitive” information wouldn’t travel with the original. > > Andy LoPresto > [email protected] > [email protected] > PGP Fingerprint: 70EC B3E5 98A6 5A3F D3C4 BACE 3C6E F65B 2F7D EF69 > > On May 14, 2018, at 3:32 PM, Mike Thomsen <[email protected]> wrote: > > Does the provenance system have the ability to add user-defined key/value > pairs to a flowfile's provenance record at a particular processor? > > On Mon, May 14, 2018 at 6:11 PM Andy LoPresto <[email protected]> wrote: > > I would actually propose that this is added to the provenance but not > always put into the flowfile attributes. There are many scenarios in which > the data retrieval should be separated from the analysis/follow-on, both > for visibility, responsibility, and security concerns. While I understand a > separate UpdateAttribute processor could be put in the downstream flow to > remove these attributes, I would push for not adding them by default as a > more secure approach. Perhaps this could be configurable on the Get* > processor via a boolean property, but I think doing it automatically by > default introduces some serious concerns. > > > Andy LoPresto > [email protected] > *[email protected] <[email protected]>* > PGP Fingerprint: 70EC B3E5 98A6 5A3F D3C4 BACE 3C6E F65B 2F7D EF69 > > On May 13, 2018, at 11:48 AM, Mike Thomsen <[email protected]> wrote: > > @Joe @Matt > > This is kinda related to the point that Joe made in the graph DB thread > about provenance. My thought here was that we need some standards on > enriching the metadata about what was fetched so that no matter how you > store the provenance, you can find some way to query it for questions like > when a data set was loaded into NiFi, how many records went through a > terminating processor, etc. IMO this could help batch-oriented > organizations feel more at ease with something stream-oriented like NiFi. > > On Fri, Apr 13, 2018 at 4:01 PM Mike Thomsen <[email protected]> > wrote: > > I'd like to propose that all non-deprecated (or likely to be deprecated) > Get/Fetch/Query processors get a standard convention for attributes that > describe things like: > > 1. Source system. > 2. Database/table/index/collection/etc. > 3. The lookup criteria that was used (similar to the "query attribute" > some already have). > > Using GetMongo as an example, it would add something like this: > > source.url=mongodb://localhost:27017 > source.database=testdb > source.collection=test_collection > source.query={ "username": "john.smith" } > source.criteria.username=john.smith //GetMongo would parse the query and > add this. > > We have a use case where a team is coming from an extremely batch-oriented > view and really wants to know when "dataset X" was run. Our solution was to > extract that from the result set because the dataset name is one of the > fields in the JSON body. > > I think this would help expand what you can do out of the box with > provenance tracking because it would provide a lot of useful information > that could be stored in Solr or ES and then queried against terminating > processors' DROP events to get a solid window into when jobs were run > historically. > > Thoughts? > > > >
