Maybe an ADDINFO event or FORK event could be used and a new flowfile with the relevant attributes/content could be created. The flowfiles would be linked, but the “sensitive” information wouldn’t travel with the original.
Andy LoPresto [email protected] [email protected] PGP Fingerprint: 70EC B3E5 98A6 5A3F D3C4 BACE 3C6E F65B 2F7D EF69 > On May 14, 2018, at 3:32 PM, Mike Thomsen <[email protected]> wrote: > > Does the provenance system have the ability to add user-defined key/value > pairs to a flowfile's provenance record at a particular processor? > > On Mon, May 14, 2018 at 6:11 PM Andy LoPresto <[email protected]> wrote: > >> I would actually propose that this is added to the provenance but not >> always put into the flowfile attributes. There are many scenarios in which >> the data retrieval should be separated from the analysis/follow-on, both >> for visibility, responsibility, and security concerns. While I understand a >> separate UpdateAttribute processor could be put in the downstream flow to >> remove these attributes, I would push for not adding them by default as a >> more secure approach. Perhaps this could be configurable on the Get* >> processor via a boolean property, but I think doing it automatically by >> default introduces some serious concerns. >> >> >> Andy LoPresto >> [email protected] >> *[email protected] <[email protected]>* >> PGP Fingerprint: 70EC B3E5 98A6 5A3F D3C4 BACE 3C6E F65B 2F7D EF69 >> >> On May 13, 2018, at 11:48 AM, Mike Thomsen <[email protected]> wrote: >> >> @Joe @Matt >> >> This is kinda related to the point that Joe made in the graph DB thread >> about provenance. My thought here was that we need some standards on >> enriching the metadata about what was fetched so that no matter how you >> store the provenance, you can find some way to query it for questions like >> when a data set was loaded into NiFi, how many records went through a >> terminating processor, etc. IMO this could help batch-oriented >> organizations feel more at ease with something stream-oriented like NiFi. >> >> On Fri, Apr 13, 2018 at 4:01 PM Mike Thomsen <[email protected]> >> wrote: >> >> I'd like to propose that all non-deprecated (or likely to be deprecated) >> Get/Fetch/Query processors get a standard convention for attributes that >> describe things like: >> >> 1. Source system. >> 2. Database/table/index/collection/etc. >> 3. The lookup criteria that was used (similar to the "query attribute" >> some already have). >> >> Using GetMongo as an example, it would add something like this: >> >> source.url=mongodb://localhost:27017 >> source.database=testdb >> source.collection=test_collection >> source.query={ "username": "john.smith" } >> source.criteria.username=john.smith //GetMongo would parse the query and >> add this. >> >> We have a use case where a team is coming from an extremely batch-oriented >> view and really wants to know when "dataset X" was run. Our solution was to >> extract that from the result set because the dataset name is one of the >> fields in the JSON body. >> >> I think this would help expand what you can do out of the box with >> provenance tracking because it would provide a lot of useful information >> that could be stored in Solr or ES and then queried against terminating >> processors' DROP events to get a solid window into when jobs were run >> historically. >> >> Thoughts? >> >> >>
signature.asc
Description: Message signed with OpenPGP using GPGMail
