Our data provenance is. Just not our repository :) On Thu, Jan 30, 2020 at 5:00 PM [email protected] <[email protected]> wrote:
> Lars > > You're absolutely right about what you say. > If the data in the NiFi repositories is only stored temporarily for a > few hours, then documentation is quite sufficient. > > The original question was how to delete data from the data lineage. > I assumed to use the NiFi repository as a full Data Lineage System. > If NiFi is your central application, then you could avoid having to > install Atlas as well. And with Atlas, you would have to install Ranger, > Cassandra or even Hadoop and HBase. > > Joe has already made it clear to me here that Data Provenance/Data > Lineage of NiFi is not designed for this yet. > Maybe in the future... > > Best > Uwe > > Am 30.01.2020 um 22:08 schrieb Lars Winderling: > > Dear Uwe and fellow devs, > > > > sorry if I completely miss the point here, but I'll try. Also working > with NiFi under GDPR-regulations in online ad business. From my point it > would be sufficient to ensure that no new data will get stored, if a user > requests deletion, and delete all personal data from all respective > systems. The NiFi repos will expire their data, which can be argued to > equal a delayed deletion. Remember that GDPR is quite strict, but if you > have a proper case for this kind of process e.g. due to technical > limitations, it needs to be documented, and then it will likely be ok. We > do it similarly, and our legal counsel approved this. My response, however, > is not legally binding. The regulation says something like you should take > appropriate measures. If such a tool like NiFi just doesn't let you delete > temporarily stored data instantly, this may seem acceptable. > > > > Best, > > Lars > > > > Am 30. Januar 2020 21:36:31 MEZ schrieb Mike Thomsen < > [email protected]>: > >> I suppose the elephant in the room here is what sort of personal data > >> is > >> being stored in your provenance records? Can't you just refactor your > >> flows > >> to ensure that the provenance data doesn't meaningful contain anything > >> traceable to a person? > >> > >> On Thu, Jan 30, 2020 at 12:41 PM [email protected] > >> <[email protected]> > >> wrote: > >> > >>> Emanuel > >>> > >>> That was not meant disrespectfully by me. And if that's how you felt, > >>> then I apologize. > >>> > >>>> In what sense does NiFi relates to GDPR compliance ? > >>> All person-related data that flows, is read, sent or stored etc. in > >> a > >>> company is GDPR relevant. > >>> > >>>> - in terms of data FF contents - they too transient (gone in 12hours > >> / > >>> default). > >>> It makes no difference how long the data is stored. And it makes no > >>> difference if data is stored on disk or just in memory. > >>> > >>> The data can potentially be read, processed by others or sent to > >> other > >>> systems and so on. Or the data can be used during this time to > >> establish > >>> relationships to other data (pseudo anonymized data etc.). > >>> > >>>> I guess discussion is on the fact FF attributes are kept on the > >> data > >>> provenance repo ? (gone in 24h / default) > >>> I'm afraid not. It's generally a matter of NiFi storing data - as > >>> already mentioned, it doesn't make any difference whether it's on the > >>> hard disk or just in memory. > >>> > >>>> I wonder where the culprit here ? > >>> There's no culprit here. It's generally a problem with GDPR when > >>> processing person-related data. > >>> It's a problem of person-related data. > >>> It is a problem of person-related data, which would fill a book, what > >> is > >>> person-related, because machine data can also be person-related, for > >>> example if I can relate a person directly to the machine and > >> place/time. > >>> This would allow me to track a person/employee and this is not > >> allowed > >>> (unless a law allows me to do so). > >>> > >>> All this goes much further and would be far too much to mention now. > >>> In principle, we have a GDPR issue and must act in accordance with > >> the law. > >>> We do not agree with all the regulation either. But all regulations I > >>> know so far have at least one justification. Even if we as enterprise > >>> architects, developers, administrators etc. have our problems with > >> them. > >>> Regards > >>> Uwe > >>> > >>> Am 30.01.2020 um 17:51 schrieb Emanuel Oliveira: > >>>> But enlight me please :) isnt GDPR just about cleaning from > >> persistent > >>>> storage ? > >>>> In what sense does NiFi relates to GDPR compliance ? > >>>> > >>>> - in terms of data FF contents - they too transient (gone in > >> 12hours / > >>>> default). > >>>> - I guess discussion is on the fact FF attributes are kept on > >> the data > >>>> provenance repo ? (gone in 24h / default) > >>>> > >>>> I wonder wheres the culprit here ? Is it in the situation hwere one > >> wants > >>>> to keep a long trace of data provenance like 6 months, but because > >>>> attributes are stored on provenance events, then they must be > >> deleted ? > >>>> I guess it can only be a problem of deleting attributes from > >> provenance > >>>> repo and no FF contents right as they gone fast enough ? > >>>> > >>>> Best Regards, > >>>> *Emanuel Oliveira* > >>>> > >>>> > >>>> > >>>> On Thu, Jan 30, 2020 at 4:42 PM Mike Thomsen > >> <[email protected]> > >>> wrote: > >>>>>> It was created on this side of the Atlantic because when people > >> do care > >>>>> about such things - they REALLY care. > >>>>> > >>>>> Agreed. I was just commenting on our particular experiences with > >>> customers > >>>>> in the federal space. There are unfortunately many who still don't > >> get > >>> all > >>>>> of the accountability traceability advantages provenance and > >> lineage > >>>>> tracking provides. > >>>>> > >>>>> On Thu, Jan 30, 2020 at 10:32 AM Joe Witt <[email protected]> > >> wrote: > >>>>>> Mike, > >>>>>> > >>>>>> It was created on this side of the Atlantic because when people > >> do care > >>>>>> about such things - they REALLY care. > >>>>>> > >>>>>> I anticipate more and more people will care and I hope that day > >> comes > >>>>>> soon. I'm proud of NiFi's ability to be a leader here because if > >> your > >>>>> flow > >>>>>> management solution between sensors and processing and storage > >> systems > >>>>>> tells you where things came from and went to it is a heck of a > >> good > >>>>> start. > >>>>>> What exists in our provenance data is information about the data > >> but > >>> this > >>>>>> can be 'any attribute' put on a flow file throughout its life in > >> the > >>>>> flow. > >>>>>> We simply cannot guarantee this wont be 'content'. The notion of > >> what > >>> is > >>>>>> metadata vs content gets blurry fast. > >>>>>> > >>>>>> Uwe, > >>>>>> > >>>>>> The data provenance capabilities within NiFi do no support the > >> ability > >>> to > >>>>>> 'delete records' based on specified parameters. The only > >> mechanism is > >>>>>> space or time based age off. For now, whatever the obligation is > >> to > >>>>>> respond to a right to be forgotten request should be what the > >>> provenance > >>>>>> within NiFi is configured to hold. If for instance you have 24 > >> hours > >>>>> then > >>>>>> provenance in NiFi should hold no more than 24 hours. > >>>>>> > >>>>>> I doubt this is something we'll be able to spend time on sooner > >> but I > >>>>> agree > >>>>>> the idea of being able to purge out records is a good one based > >> on more > >>>>>> precise parameters. > >>>>>> > >>>>>> The intent is not that the built-in nifi provenance store is for > >> long > >>>>> term > >>>>>> but rather the records are there long enough to support flow > >> management > >>>>> use > >>>>>> cases but are always being exported to a long term store such as > >> Atlas > >>> or > >>>>>> even just stored in HDFS or other locations for additional use. > >> One > >>>>>> day...a sweet graph database... > >>>>>> > >>>>>> Thanks > >>>>>> Joe > >>>>>> > >>>>>> On Thu, Jan 30, 2020 at 10:29 AM Emanuel Oliveira > >> <[email protected]> > >>>>>> wrote: > >>>>>> > >>>>>>> Hi, > >>>>>>> > >>>>>>> Some recap on NiFi concepts: > >>>>>>> > >>>>>>> - Content Repository stores FF contents. > >>>>>>> - Data Provenance events -used to check lineage of history of > >> FFs- > >>>>>> only > >>>>>>> stores pointers to FFs (not contents). > >>>>>>> - so one can have data deleted and still access lineage/data > >>>>>> provenance > >>>>>>> history. > >>>>>>> > >>>>>>> Heres a lof of in-depth on the subject, but above 3 points are > >> the > >>>>>>> summary of all: > >>>>>>> https://nifi.apache.org/docs/nifi-docs/html/nifi-in-depth.html > >>>>>>> > >>>>>>> > >>>>>>> *DATA - persistent data only exists in 2 scenarios:* > >>>>>>> > >>>>>>> - while your flow file running. > >>>>>>> - archived on content repository for 12h (to allow access > >> contents > >>>>>> when > >>>>>>> using inspect data provenance/lineage). > >>>>>>> > >>>>>>> > >> > https://community.cloudera.com/t5/Community-Articles/Understanding-how-NiFi-s-Content-Repository-Archiving-works/ta-p/249418 > >>>>>>> *PROVENANCE EVENTS (LINEAGE) OF DATA:* > >>>>>>> > >>>>>>> - contains only provenance attributes and FF uuid etcbut NO > >>>>> CONTENTS, > >>>>>>> available for 24h unless increasing/changed on config files. > >>>>>>> - > >>>>>>> > >>>>>>> > >> > https://nifi.apache.org/docs/nifi-docs/html/administration-guide.html#persistent-provenance-repository-properties > >>>>>>> > >>>>>>> So as you see both context by default expire daily. fast enough > >> that > >>>>> dont > >>>>>>> think GDPR is any problem or any action needed. > >>>>>>> Now one can always boosts retention of just data provenance > >> events for > >>>>>>> months, 1 year or whatever suits. But data is long gone anyway. > >>>>>>> > >>>>>>> Best Regards, > >>>>>>> *Emanuel Oliveira* > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> On Thu, Jan 30, 2020 at 2:26 PM [email protected] < > >>> [email protected] > >>>>>>> wrote: > >>>>>>> > >>>>>>>> Hi, > >>>>>>>> > >>>>>>>>> GDPR doesnt need milisecond realtime deletion right ?) > >>>>>>>> right. > >>>>>>>> > >>>>>>>>> since inbound FFs have > >>>>>>>>> normally hundreds, thousands of records that will need to > >> split, > >>>>>>>> aggregate, > >>>>>>>>> in complex flow file, implementing a clean > >>>>>>>> It depends on your application. Not everyone uses NiFi for IoT > >> and > >>>>>>>> therefore a single record may be included. > >>>>>>>> > >>>>>>>>> In my opinion your answer to business/management gate keepers > >> is > >>>>> that > >>>>>>>> data > >>>>>>>>> will be stored on data provenance for 24h (default) which can > >> be > >>>>>>>>> configured, and that > >>>>>>>> This is not necessarily the point of the Data Lineage, that the > >>>>>>>> information is deleted after 24 hours (or whatever is > >> configured). > >>>>>>>> If Data Lineage is needed (revision, legal requirements etc.), > >> then > >>>>>>>> deleting the data after a defined time is not an option. > >>>>>>>> > >>>>>>>> This is the reason why Atlas supports it. > >>>>>>>> > >>>>>>>> Best Regards, > >>>>>>>> Uwe > >>>>>>>> > >>>>>>>> Am 30.01.2020 um 15:06 schrieb Emanuel Oliveira: > >>>>>>>>> Hi, dont think makes sense an api for atomic records: > >>>>>>>>> > >>>>>>>>> 1. one configure retention od data provenance (default 24h > >> is > >>>>>> "good > >>>>>>>>> enough" GDPR doesnt need milisecond realtime deletion right > >> ?) > >> > https://nifi.apache.org/docs/nifi-docs/html/administration-guide.html#persistent-provenance-repository-properties > >>>>>>>>> 2. even if there would be one api to delete FF's with an > >>>>>> attribute = > >>>>>>>>> <some id>, that would normally be useless as well, since > >> inbound > >>>>>> FFs > >>>>>>>> have > >>>>>>>>> normally hundreds, thousands of records that will need to > >> split, > >>>>>>>> aggregate, > >>>>>>>>> in complex flow file, implementing a clean up an nano > >> atomic > >>>>> level > >>>>>>>> would be > >>>>>>>>> to hard and extra effort not needed, since your target > >> single > >>>>>> record > >>>>>>>> would > >>>>>>>>> surely be part of multiple FF UUIDs, some only holding your > >>>>>> record, > >>>>>>>> but mot > >>>>>>>>> surefly will have 100s, 100s of other records including > >> your > >>>>>> record > >>>>>>>>> somewhere on the middle. > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> In my opinion your answer to business/management gate keepers > >> is > >>>>> that > >>>>>>>> data > >>>>>>>>> will be stored on data provenance for 24h (default) which can > >> be > >>>>>>>>> configured, and that > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> Best Regards, > >>>>>>>>> *Emanuel Oliveira* > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> On Thu, Jan 30, 2020 at 1:54 PM [email protected] < > >>>>>> [email protected] > >>>>>>>>> wrote: > >>>>>>>>> > >>>>>>>>>> Dear NiFi developer team, > >>>>>>>>>> > >>>>>>>>>> NiFi's Data Provenance and Data Lineage is perfectly adequate > >> in > >>>>> the > >>>>>>>>>> environment of NiFi, so there is often no need to use Atlas. > >>>>>>>>>> > >>>>>>>>>> When using NiFi with customer data a problem arises. > >>>>>>>>>> The problem is the GDPR requirement that a user has the right > >> to > >>>>> be > >>>>>>>>>> forgotten. Unfortunately, I can't find any API call or > >> information > >>>>>> on > >>>>>>>>>> how to delete individual user data from the NiFi Provenance > >>>>>> Repository > >>>>>>>>>> based on a user-defined attribute and its defined > >> characteristics. > >>>>>>>>>> A delete request like "delete all data and dependencies where > >> the > >>>>>>>>>> attribute XYZ has the value 123" is currently not possible to > >> my > >>>>>>>> knowledge. > >>>>>>>>>> My questions are: > >>>>>>>>>> Is this actually possible and how? And if not, is it planned? > >>>>>>>>>> > >>>>>>>>>> Thanks > >>>>>>>>>> Uwe > >>>>>>>>>> > >>> > >
