Re: Proposal: standard record metadata attributes for data sources

Andy LoPresto Mon, 14 May 2018 16:16:09 -0700

Maybe an ADDINFO event or FORK event could be used and a new flowfile with the 
relevant attributes/content could be created. The flowfiles would be linked, 
but the “sensitive” information wouldn’t travel with the original.


Andy LoPresto
[email protected]
[email protected]
PGP Fingerprint: 70EC B3E5 98A6 5A3F D3C4  BACE 3C6E F65B 2F7D EF69

> On May 14, 2018, at 3:32 PM, Mike Thomsen <[email protected]> wrote:
> 
> Does the provenance system have the ability to add user-defined key/value
> pairs to a flowfile's provenance record at a particular processor?
> 
> On Mon, May 14, 2018 at 6:11 PM Andy LoPresto <[email protected]> wrote:
> 
>> I would actually propose that this is added to the provenance but not
>> always put into the flowfile attributes. There are many scenarios in which
>> the data retrieval should be separated from the analysis/follow-on, both
>> for visibility, responsibility, and security concerns. While I understand a
>> separate UpdateAttribute processor could be put in the downstream flow to
>> remove these attributes, I would push for not adding them by default as a
>> more secure approach. Perhaps this could be configurable on the Get*
>> processor via a boolean property, but I think doing it automatically by
>> default introduces some serious concerns.
>> 
>> 
>> Andy LoPresto
>> [email protected]
>> *[email protected] <[email protected]>*
>> PGP Fingerprint: 70EC B3E5 98A6 5A3F D3C4  BACE 3C6E F65B 2F7D EF69
>> 
>> On May 13, 2018, at 11:48 AM, Mike Thomsen <[email protected]> wrote:
>> 
>> @Joe @Matt
>> 
>> This is kinda related to the point that Joe made in the graph DB thread
>> about provenance. My thought here was that we need some standards on
>> enriching the metadata about what was fetched so that no matter how you
>> store the provenance, you can find some way to query it for questions like
>> when a data set was loaded into NiFi, how many records went through a
>> terminating processor, etc. IMO this could help batch-oriented
>> organizations feel more at ease with something stream-oriented like NiFi.
>> 
>> On Fri, Apr 13, 2018 at 4:01 PM Mike Thomsen <[email protected]>
>> wrote:
>> 
>> I'd like to propose that all non-deprecated (or likely to be deprecated)
>> Get/Fetch/Query processors get a standard convention for attributes that
>> describe things like:
>> 
>> 1. Source system.
>> 2. Database/table/index/collection/etc.
>> 3. The lookup criteria that was used (similar to the "query attribute"
>> some already have).
>> 
>> Using GetMongo as an example, it would add something like this:
>> 
>> source.url=mongodb://localhost:27017
>> source.database=testdb
>> source.collection=test_collection
>> source.query={ "username": "john.smith" }
>> source.criteria.username=john.smith //GetMongo would parse the query and
>> add this.
>> 
>> We have a use case where a team is coming from an extremely batch-oriented
>> view and really wants to know when "dataset X" was run. Our solution was to
>> extract that from the result set because the dataset name is one of the
>> fields in the JSON body.
>> 
>> I think this would help expand what you can do out of the box with
>> provenance tracking because it would provide a lot of useful information
>> that could be stored in Solr or ES and then queried against terminating
>> processors' DROP events to get a solid window into when jobs were run
>> historically.
>> 
>> Thoughts?
>> 
>> 
>>

signature.asc
Description: Message signed with OpenPGP using GPGMail

Re: Proposal: standard record metadata attributes for data sources

Reply via email to