Lars

You're absolutely right about what you say.
If the data in the NiFi repositories is only stored temporarily for a
few hours, then documentation is quite sufficient.

The original question was how to delete data from the data lineage.
I assumed to use the NiFi repository as a full Data Lineage System.
If NiFi is your central application, then you could avoid having to
install Atlas as well. And with Atlas, you would have to install Ranger,
Cassandra or even Hadoop and HBase.

Joe has already made it clear to me here that Data Provenance/Data
Lineage of NiFi is not designed for this yet.
Maybe in the future...

Best
Uwe

Am 30.01.2020 um 22:08 schrieb Lars Winderling:
> Dear Uwe and fellow devs,
>
> sorry if I completely miss the point here, but I'll try. Also working with 
> NiFi under GDPR-regulations in online ad business. From my point it would be 
> sufficient to ensure that no new data will get stored, if a user requests 
> deletion, and delete all personal data from all respective systems. The NiFi 
> repos will expire their data, which can be argued to equal a delayed 
> deletion. Remember that GDPR is quite strict, but if you have a proper case 
> for this kind of process e.g. due to technical limitations, it needs to be 
> documented, and then it will likely be ok. We do it similarly, and our legal 
> counsel approved this. My response, however, is not legally binding. The 
> regulation says something like you should take appropriate measures. If such 
> a tool like NiFi just doesn't let you delete temporarily stored data 
> instantly, this may seem acceptable.
>
> Best,
> Lars
>
> Am 30. Januar 2020 21:36:31 MEZ schrieb Mike Thomsen <[email protected]>:
>> I suppose the elephant in the room here is what sort of personal data
>> is
>> being stored in your provenance records? Can't you just refactor your
>> flows
>> to ensure that the provenance data doesn't meaningful contain anything
>> traceable to a person?
>>
>> On Thu, Jan 30, 2020 at 12:41 PM [email protected]
>> <[email protected]>
>> wrote:
>>
>>> Emanuel
>>>
>>> That was not meant disrespectfully by me. And if that's how you felt,
>>> then I apologize.
>>>
>>>> In what sense does NiFi relates to GDPR compliance ?
>>> All person-related data that flows, is read, sent or stored etc.  in
>> a
>>> company is GDPR relevant.
>>>
>>>> - in terms of data FF contents - they too transient (gone in 12hours
>> /
>>> default).
>>> It makes no difference how long the data is stored. And it makes no
>>> difference if data is stored on disk or just in memory.
>>>
>>> The data can potentially be read, processed by others or sent to
>> other
>>> systems and so on. Or the data can be used during this time to
>> establish
>>> relationships to other data (pseudo anonymized data etc.).
>>>
>>>> I guess discussion is on the fact FF attributes are kept on the
>> data
>>>    provenance repo ? (gone in 24h / default)
>>> I'm afraid not. It's generally a matter of NiFi storing data - as
>>> already mentioned, it doesn't make any difference whether it's on the
>>> hard disk or just in memory.
>>>
>>>> I wonder where the culprit here ?
>>> There's no culprit here. It's generally a problem with GDPR when
>>> processing person-related data.
>>> It's a problem of person-related data.
>>> It is a problem of person-related data, which would fill a book, what
>> is
>>> person-related, because machine data can also be person-related, for
>>> example if I can relate a person directly to the machine and
>> place/time.
>>> This would allow me to track a person/employee and this is not
>> allowed
>>> (unless a law allows me to do so).
>>>
>>> All this goes much further and would be far too much to mention now.
>>> In principle, we have a GDPR issue and must act in accordance with
>> the law.
>>> We do not agree with all the regulation either. But all regulations I
>>> know so far have at least one justification. Even if we as enterprise
>>> architects, developers, administrators etc. have our problems with
>> them.
>>> Regards
>>> Uwe
>>>
>>> Am 30.01.2020 um 17:51 schrieb Emanuel Oliveira:
>>>> But enlight me please :) isnt GDPR just about cleaning from
>> persistent
>>>> storage ?
>>>> In what sense does NiFi relates to GDPR compliance ?
>>>>
>>>>    - in terms of data FF contents - they too transient (gone in
>> 12hours /
>>>>    default).
>>>>    - I guess discussion is on the fact FF attributes are kept on
>> the data
>>>>    provenance repo ? (gone in 24h / default)
>>>>
>>>> I wonder wheres the culprit here ? Is it in the situation hwere one
>> wants
>>>> to keep a long trace of data provenance like 6 months, but because
>>>> attributes are stored on provenance events, then they must be
>> deleted ?
>>>> I guess it can only be a problem of deleting attributes from
>> provenance
>>>> repo and no FF contents right as they gone fast enough ?
>>>>
>>>> Best Regards,
>>>> *Emanuel Oliveira*
>>>>
>>>>
>>>>
>>>> On Thu, Jan 30, 2020 at 4:42 PM Mike Thomsen
>> <[email protected]>
>>> wrote:
>>>>>> It was created on this side of the Atlantic because when people
>> do care
>>>>> about such things - they REALLY care.
>>>>>
>>>>> Agreed. I was just commenting on our particular experiences with
>>> customers
>>>>> in the federal space. There are unfortunately many who still don't
>> get
>>> all
>>>>> of the accountability traceability advantages provenance and
>> lineage
>>>>> tracking provides.
>>>>>
>>>>> On Thu, Jan 30, 2020 at 10:32 AM Joe Witt <[email protected]>
>> wrote:
>>>>>> Mike,
>>>>>>
>>>>>> It was created on this side of the Atlantic because when people
>> do care
>>>>>> about such things - they REALLY care.
>>>>>>
>>>>>> I anticipate more and more people will care and I hope that day
>> comes
>>>>>> soon.  I'm proud of NiFi's ability to be a leader here because if
>> your
>>>>> flow
>>>>>> management solution between sensors and processing and storage
>> systems
>>>>>> tells you where things came from and went to it is a heck of a
>> good
>>>>> start.
>>>>>> What exists in our provenance data is information about the data
>> but
>>> this
>>>>>> can be 'any attribute' put on a flow file throughout its life in
>> the
>>>>> flow.
>>>>>> We simply cannot guarantee this wont be 'content'.  The notion of
>> what
>>> is
>>>>>> metadata vs content gets blurry fast.
>>>>>>
>>>>>> Uwe,
>>>>>>
>>>>>> The data provenance capabilities within NiFi do no support the
>> ability
>>> to
>>>>>> 'delete records' based on specified parameters.  The only
>> mechanism is
>>>>>> space or time based age off.  For now, whatever the obligation is
>> to
>>>>>> respond to a right to be forgotten request should be what the
>>> provenance
>>>>>> within NiFi is configured to hold.  If for instance you have 24
>> hours
>>>>> then
>>>>>> provenance in NiFi should hold no more than 24 hours.
>>>>>>
>>>>>> I doubt this is something we'll be able to spend time on sooner
>> but I
>>>>> agree
>>>>>> the idea of being able to purge out records is a good one based
>> on more
>>>>>> precise parameters.
>>>>>>
>>>>>> The intent is not that the built-in nifi provenance store is for
>> long
>>>>> term
>>>>>> but rather the records are there long enough to support flow
>> management
>>>>> use
>>>>>> cases but are always being exported to a long term store such as
>> Atlas
>>> or
>>>>>> even just stored in HDFS or other locations for additional use. 
>> One
>>>>>> day...a sweet graph database...
>>>>>>
>>>>>> Thanks
>>>>>> Joe
>>>>>>
>>>>>> On Thu, Jan 30, 2020 at 10:29 AM Emanuel Oliveira
>> <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> Some recap on NiFi concepts:
>>>>>>>
>>>>>>>    - Content Repository stores FF contents.
>>>>>>>    - Data Provenance events -used to check lineage of history of
>> FFs-
>>>>>> only
>>>>>>>    stores pointers to FFs (not contents).
>>>>>>>    - so one can have data deleted and still access lineage/data
>>>>>> provenance
>>>>>>>    history.
>>>>>>>
>>>>>>> Heres a lof of in-depth on the subject, but above 3 points are
>> the
>>>>>>> summary of all:
>>>>>>> https://nifi.apache.org/docs/nifi-docs/html/nifi-in-depth.html
>>>>>>>
>>>>>>>
>>>>>>> *DATA - persistent data only exists in 2 scenarios:*
>>>>>>>
>>>>>>>    - while your flow file running.
>>>>>>>    - archived on content repository for 12h (to allow access
>> contents
>>>>>> when
>>>>>>>    using inspect data provenance/lineage).
>>>>>>>
>>>>>>>
>> https://community.cloudera.com/t5/Community-Articles/Understanding-how-NiFi-s-Content-Repository-Archiving-works/ta-p/249418
>>>>>>> *PROVENANCE EVENTS (LINEAGE) OF DATA:*
>>>>>>>
>>>>>>>    - contains only provenance attributes and FF uuid etcbut NO
>>>>> CONTENTS,
>>>>>>>    available for 24h unless increasing/changed on config files.
>>>>>>>    -
>>>>>>>
>>>>>>>
>> https://nifi.apache.org/docs/nifi-docs/html/administration-guide.html#persistent-provenance-repository-properties
>>>>>>>
>>>>>>> So as you see both context by default expire daily. fast enough
>> that
>>>>> dont
>>>>>>> think GDPR is any problem or any action needed.
>>>>>>> Now one can always boosts retention of just data provenance
>> events for
>>>>>>> months, 1 year or whatever suits. But data is long gone anyway.
>>>>>>>
>>>>>>> Best Regards,
>>>>>>> *Emanuel Oliveira*
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Jan 30, 2020 at 2:26 PM [email protected] <
>>> [email protected]
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>>> GDPR doesnt need milisecond realtime deletion right ?)
>>>>>>>> right.
>>>>>>>>
>>>>>>>>> since inbound FFs have
>>>>>>>>>    normally hundreds, thousands of records that will need to
>> split,
>>>>>>>> aggregate,
>>>>>>>>>    in complex flow file, implementing a clean
>>>>>>>> It depends on your application. Not everyone uses NiFi for IoT
>> and
>>>>>>>> therefore a single record may be included.
>>>>>>>>
>>>>>>>>> In my opinion your answer to business/management gate keepers
>> is
>>>>> that
>>>>>>>> data
>>>>>>>>> will be stored on data provenance for 24h (default) which can
>> be
>>>>>>>>> configured, and that
>>>>>>>> This is not necessarily the point of the Data Lineage, that the
>>>>>>>> information is deleted after 24 hours (or whatever is
>> configured).
>>>>>>>> If Data Lineage is needed (revision, legal requirements etc.),
>> then
>>>>>>>> deleting the data after a defined time is not an option.
>>>>>>>>
>>>>>>>> This is the reason why Atlas supports it.
>>>>>>>>
>>>>>>>> Best Regards,
>>>>>>>> Uwe
>>>>>>>>
>>>>>>>> Am 30.01.2020 um 15:06 schrieb Emanuel Oliveira:
>>>>>>>>> Hi, dont think makes sense an api for atomic records:
>>>>>>>>>
>>>>>>>>>    1. one configure retention od data provenance (default 24h
>> is
>>>>>> "good
>>>>>>>>>    enough" GDPR doesnt need milisecond realtime deletion right
>> ?)
>> https://nifi.apache.org/docs/nifi-docs/html/administration-guide.html#persistent-provenance-repository-properties
>>>>>>>>>    2. even if there would be one api to delete FF's with an
>>>>>> attribute =
>>>>>>>>>    <some id>, that would normally be useless as well, since
>> inbound
>>>>>> FFs
>>>>>>>> have
>>>>>>>>>    normally hundreds, thousands of records that will need to
>> split,
>>>>>>>> aggregate,
>>>>>>>>>    in complex flow file, implementing a clean up an nano
>> atomic
>>>>> level
>>>>>>>> would be
>>>>>>>>>    to hard and extra effort not needed, since your target
>> single
>>>>>> record
>>>>>>>> would
>>>>>>>>>    surely be part of multiple FF UUIDs, some only holding your
>>>>>> record,
>>>>>>>> but mot
>>>>>>>>>    surefly will have 100s, 100s of other records including
>> your
>>>>>> record
>>>>>>>>>    somewhere on the middle.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> In my opinion your answer to business/management gate keepers
>> is
>>>>> that
>>>>>>>> data
>>>>>>>>> will be stored on data provenance for 24h (default) which can
>> be
>>>>>>>>> configured, and that
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Best Regards,
>>>>>>>>> *Emanuel Oliveira*
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Thu, Jan 30, 2020 at 1:54 PM [email protected] <
>>>>>> [email protected]
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Dear NiFi developer team,
>>>>>>>>>>
>>>>>>>>>> NiFi's Data Provenance and Data Lineage is perfectly adequate
>> in
>>>>> the
>>>>>>>>>> environment of NiFi, so there is often no need to use Atlas.
>>>>>>>>>>
>>>>>>>>>> When using NiFi with customer data a problem arises.
>>>>>>>>>> The problem is the GDPR requirement that a user has the right
>> to
>>>>> be
>>>>>>>>>> forgotten. Unfortunately, I can't find any API call or
>> information
>>>>>> on
>>>>>>>>>> how to delete individual user data from the NiFi Provenance
>>>>>> Repository
>>>>>>>>>> based on a user-defined attribute and its defined
>> characteristics.
>>>>>>>>>> A delete request like "delete all data and dependencies where
>> the
>>>>>>>>>> attribute XYZ has the value 123" is currently not possible to
>> my
>>>>>>>> knowledge.
>>>>>>>>>> My questions are:
>>>>>>>>>> Is this actually possible and how? And if not, is it planned?
>>>>>>>>>>
>>>>>>>>>> Thanks
>>>>>>>>>> Uwe
>>>>>>>>>>
>>>

Reply via email to