Our data provenance is.  Just not our repository :)

On Thu, Jan 30, 2020 at 5:00 PM [email protected] <[email protected]>
wrote:

> Lars
>
> You're absolutely right about what you say.
> If the data in the NiFi repositories is only stored temporarily for a
> few hours, then documentation is quite sufficient.
>
> The original question was how to delete data from the data lineage.
> I assumed to use the NiFi repository as a full Data Lineage System.
> If NiFi is your central application, then you could avoid having to
> install Atlas as well. And with Atlas, you would have to install Ranger,
> Cassandra or even Hadoop and HBase.
>
> Joe has already made it clear to me here that Data Provenance/Data
> Lineage of NiFi is not designed for this yet.
> Maybe in the future...
>
> Best
> Uwe
>
> Am 30.01.2020 um 22:08 schrieb Lars Winderling:
> > Dear Uwe and fellow devs,
> >
> > sorry if I completely miss the point here, but I'll try. Also working
> with NiFi under GDPR-regulations in online ad business. From my point it
> would be sufficient to ensure that no new data will get stored, if a user
> requests deletion, and delete all personal data from all respective
> systems. The NiFi repos will expire their data, which can be argued to
> equal a delayed deletion. Remember that GDPR is quite strict, but if you
> have a proper case for this kind of process e.g. due to technical
> limitations, it needs to be documented, and then it will likely be ok. We
> do it similarly, and our legal counsel approved this. My response, however,
> is not legally binding. The regulation says something like you should take
> appropriate measures. If such a tool like NiFi just doesn't let you delete
> temporarily stored data instantly, this may seem acceptable.
> >
> > Best,
> > Lars
> >
> > Am 30. Januar 2020 21:36:31 MEZ schrieb Mike Thomsen <
> [email protected]>:
> >> I suppose the elephant in the room here is what sort of personal data
> >> is
> >> being stored in your provenance records? Can't you just refactor your
> >> flows
> >> to ensure that the provenance data doesn't meaningful contain anything
> >> traceable to a person?
> >>
> >> On Thu, Jan 30, 2020 at 12:41 PM [email protected]
> >> <[email protected]>
> >> wrote:
> >>
> >>> Emanuel
> >>>
> >>> That was not meant disrespectfully by me. And if that's how you felt,
> >>> then I apologize.
> >>>
> >>>> In what sense does NiFi relates to GDPR compliance ?
> >>> All person-related data that flows, is read, sent or stored etc.  in
> >> a
> >>> company is GDPR relevant.
> >>>
> >>>> - in terms of data FF contents - they too transient (gone in 12hours
> >> /
> >>> default).
> >>> It makes no difference how long the data is stored. And it makes no
> >>> difference if data is stored on disk or just in memory.
> >>>
> >>> The data can potentially be read, processed by others or sent to
> >> other
> >>> systems and so on. Or the data can be used during this time to
> >> establish
> >>> relationships to other data (pseudo anonymized data etc.).
> >>>
> >>>> I guess discussion is on the fact FF attributes are kept on the
> >> data
> >>>    provenance repo ? (gone in 24h / default)
> >>> I'm afraid not. It's generally a matter of NiFi storing data - as
> >>> already mentioned, it doesn't make any difference whether it's on the
> >>> hard disk or just in memory.
> >>>
> >>>> I wonder where the culprit here ?
> >>> There's no culprit here. It's generally a problem with GDPR when
> >>> processing person-related data.
> >>> It's a problem of person-related data.
> >>> It is a problem of person-related data, which would fill a book, what
> >> is
> >>> person-related, because machine data can also be person-related, for
> >>> example if I can relate a person directly to the machine and
> >> place/time.
> >>> This would allow me to track a person/employee and this is not
> >> allowed
> >>> (unless a law allows me to do so).
> >>>
> >>> All this goes much further and would be far too much to mention now.
> >>> In principle, we have a GDPR issue and must act in accordance with
> >> the law.
> >>> We do not agree with all the regulation either. But all regulations I
> >>> know so far have at least one justification. Even if we as enterprise
> >>> architects, developers, administrators etc. have our problems with
> >> them.
> >>> Regards
> >>> Uwe
> >>>
> >>> Am 30.01.2020 um 17:51 schrieb Emanuel Oliveira:
> >>>> But enlight me please :) isnt GDPR just about cleaning from
> >> persistent
> >>>> storage ?
> >>>> In what sense does NiFi relates to GDPR compliance ?
> >>>>
> >>>>    - in terms of data FF contents - they too transient (gone in
> >> 12hours /
> >>>>    default).
> >>>>    - I guess discussion is on the fact FF attributes are kept on
> >> the data
> >>>>    provenance repo ? (gone in 24h / default)
> >>>>
> >>>> I wonder wheres the culprit here ? Is it in the situation hwere one
> >> wants
> >>>> to keep a long trace of data provenance like 6 months, but because
> >>>> attributes are stored on provenance events, then they must be
> >> deleted ?
> >>>> I guess it can only be a problem of deleting attributes from
> >> provenance
> >>>> repo and no FF contents right as they gone fast enough ?
> >>>>
> >>>> Best Regards,
> >>>> *Emanuel Oliveira*
> >>>>
> >>>>
> >>>>
> >>>> On Thu, Jan 30, 2020 at 4:42 PM Mike Thomsen
> >> <[email protected]>
> >>> wrote:
> >>>>>> It was created on this side of the Atlantic because when people
> >> do care
> >>>>> about such things - they REALLY care.
> >>>>>
> >>>>> Agreed. I was just commenting on our particular experiences with
> >>> customers
> >>>>> in the federal space. There are unfortunately many who still don't
> >> get
> >>> all
> >>>>> of the accountability traceability advantages provenance and
> >> lineage
> >>>>> tracking provides.
> >>>>>
> >>>>> On Thu, Jan 30, 2020 at 10:32 AM Joe Witt <[email protected]>
> >> wrote:
> >>>>>> Mike,
> >>>>>>
> >>>>>> It was created on this side of the Atlantic because when people
> >> do care
> >>>>>> about such things - they REALLY care.
> >>>>>>
> >>>>>> I anticipate more and more people will care and I hope that day
> >> comes
> >>>>>> soon.  I'm proud of NiFi's ability to be a leader here because if
> >> your
> >>>>> flow
> >>>>>> management solution between sensors and processing and storage
> >> systems
> >>>>>> tells you where things came from and went to it is a heck of a
> >> good
> >>>>> start.
> >>>>>> What exists in our provenance data is information about the data
> >> but
> >>> this
> >>>>>> can be 'any attribute' put on a flow file throughout its life in
> >> the
> >>>>> flow.
> >>>>>> We simply cannot guarantee this wont be 'content'.  The notion of
> >> what
> >>> is
> >>>>>> metadata vs content gets blurry fast.
> >>>>>>
> >>>>>> Uwe,
> >>>>>>
> >>>>>> The data provenance capabilities within NiFi do no support the
> >> ability
> >>> to
> >>>>>> 'delete records' based on specified parameters.  The only
> >> mechanism is
> >>>>>> space or time based age off.  For now, whatever the obligation is
> >> to
> >>>>>> respond to a right to be forgotten request should be what the
> >>> provenance
> >>>>>> within NiFi is configured to hold.  If for instance you have 24
> >> hours
> >>>>> then
> >>>>>> provenance in NiFi should hold no more than 24 hours.
> >>>>>>
> >>>>>> I doubt this is something we'll be able to spend time on sooner
> >> but I
> >>>>> agree
> >>>>>> the idea of being able to purge out records is a good one based
> >> on more
> >>>>>> precise parameters.
> >>>>>>
> >>>>>> The intent is not that the built-in nifi provenance store is for
> >> long
> >>>>> term
> >>>>>> but rather the records are there long enough to support flow
> >> management
> >>>>> use
> >>>>>> cases but are always being exported to a long term store such as
> >> Atlas
> >>> or
> >>>>>> even just stored in HDFS or other locations for additional use.
> >> One
> >>>>>> day...a sweet graph database...
> >>>>>>
> >>>>>> Thanks
> >>>>>> Joe
> >>>>>>
> >>>>>> On Thu, Jan 30, 2020 at 10:29 AM Emanuel Oliveira
> >> <[email protected]>
> >>>>>> wrote:
> >>>>>>
> >>>>>>> Hi,
> >>>>>>>
> >>>>>>> Some recap on NiFi concepts:
> >>>>>>>
> >>>>>>>    - Content Repository stores FF contents.
> >>>>>>>    - Data Provenance events -used to check lineage of history of
> >> FFs-
> >>>>>> only
> >>>>>>>    stores pointers to FFs (not contents).
> >>>>>>>    - so one can have data deleted and still access lineage/data
> >>>>>> provenance
> >>>>>>>    history.
> >>>>>>>
> >>>>>>> Heres a lof of in-depth on the subject, but above 3 points are
> >> the
> >>>>>>> summary of all:
> >>>>>>> https://nifi.apache.org/docs/nifi-docs/html/nifi-in-depth.html
> >>>>>>>
> >>>>>>>
> >>>>>>> *DATA - persistent data only exists in 2 scenarios:*
> >>>>>>>
> >>>>>>>    - while your flow file running.
> >>>>>>>    - archived on content repository for 12h (to allow access
> >> contents
> >>>>>> when
> >>>>>>>    using inspect data provenance/lineage).
> >>>>>>>
> >>>>>>>
> >>
> https://community.cloudera.com/t5/Community-Articles/Understanding-how-NiFi-s-Content-Repository-Archiving-works/ta-p/249418
> >>>>>>> *PROVENANCE EVENTS (LINEAGE) OF DATA:*
> >>>>>>>
> >>>>>>>    - contains only provenance attributes and FF uuid etcbut NO
> >>>>> CONTENTS,
> >>>>>>>    available for 24h unless increasing/changed on config files.
> >>>>>>>    -
> >>>>>>>
> >>>>>>>
> >>
> https://nifi.apache.org/docs/nifi-docs/html/administration-guide.html#persistent-provenance-repository-properties
> >>>>>>>
> >>>>>>> So as you see both context by default expire daily. fast enough
> >> that
> >>>>> dont
> >>>>>>> think GDPR is any problem or any action needed.
> >>>>>>> Now one can always boosts retention of just data provenance
> >> events for
> >>>>>>> months, 1 year or whatever suits. But data is long gone anyway.
> >>>>>>>
> >>>>>>> Best Regards,
> >>>>>>> *Emanuel Oliveira*
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> On Thu, Jan 30, 2020 at 2:26 PM [email protected] <
> >>> [email protected]
> >>>>>>> wrote:
> >>>>>>>
> >>>>>>>> Hi,
> >>>>>>>>
> >>>>>>>>> GDPR doesnt need milisecond realtime deletion right ?)
> >>>>>>>> right.
> >>>>>>>>
> >>>>>>>>> since inbound FFs have
> >>>>>>>>>    normally hundreds, thousands of records that will need to
> >> split,
> >>>>>>>> aggregate,
> >>>>>>>>>    in complex flow file, implementing a clean
> >>>>>>>> It depends on your application. Not everyone uses NiFi for IoT
> >> and
> >>>>>>>> therefore a single record may be included.
> >>>>>>>>
> >>>>>>>>> In my opinion your answer to business/management gate keepers
> >> is
> >>>>> that
> >>>>>>>> data
> >>>>>>>>> will be stored on data provenance for 24h (default) which can
> >> be
> >>>>>>>>> configured, and that
> >>>>>>>> This is not necessarily the point of the Data Lineage, that the
> >>>>>>>> information is deleted after 24 hours (or whatever is
> >> configured).
> >>>>>>>> If Data Lineage is needed (revision, legal requirements etc.),
> >> then
> >>>>>>>> deleting the data after a defined time is not an option.
> >>>>>>>>
> >>>>>>>> This is the reason why Atlas supports it.
> >>>>>>>>
> >>>>>>>> Best Regards,
> >>>>>>>> Uwe
> >>>>>>>>
> >>>>>>>> Am 30.01.2020 um 15:06 schrieb Emanuel Oliveira:
> >>>>>>>>> Hi, dont think makes sense an api for atomic records:
> >>>>>>>>>
> >>>>>>>>>    1. one configure retention od data provenance (default 24h
> >> is
> >>>>>> "good
> >>>>>>>>>    enough" GDPR doesnt need milisecond realtime deletion right
> >> ?)
> >>
> https://nifi.apache.org/docs/nifi-docs/html/administration-guide.html#persistent-provenance-repository-properties
> >>>>>>>>>    2. even if there would be one api to delete FF's with an
> >>>>>> attribute =
> >>>>>>>>>    <some id>, that would normally be useless as well, since
> >> inbound
> >>>>>> FFs
> >>>>>>>> have
> >>>>>>>>>    normally hundreds, thousands of records that will need to
> >> split,
> >>>>>>>> aggregate,
> >>>>>>>>>    in complex flow file, implementing a clean up an nano
> >> atomic
> >>>>> level
> >>>>>>>> would be
> >>>>>>>>>    to hard and extra effort not needed, since your target
> >> single
> >>>>>> record
> >>>>>>>> would
> >>>>>>>>>    surely be part of multiple FF UUIDs, some only holding your
> >>>>>> record,
> >>>>>>>> but mot
> >>>>>>>>>    surefly will have 100s, 100s of other records including
> >> your
> >>>>>> record
> >>>>>>>>>    somewhere on the middle.
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> In my opinion your answer to business/management gate keepers
> >> is
> >>>>> that
> >>>>>>>> data
> >>>>>>>>> will be stored on data provenance for 24h (default) which can
> >> be
> >>>>>>>>> configured, and that
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Best Regards,
> >>>>>>>>> *Emanuel Oliveira*
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> On Thu, Jan 30, 2020 at 1:54 PM [email protected] <
> >>>>>> [email protected]
> >>>>>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>>> Dear NiFi developer team,
> >>>>>>>>>>
> >>>>>>>>>> NiFi's Data Provenance and Data Lineage is perfectly adequate
> >> in
> >>>>> the
> >>>>>>>>>> environment of NiFi, so there is often no need to use Atlas.
> >>>>>>>>>>
> >>>>>>>>>> When using NiFi with customer data a problem arises.
> >>>>>>>>>> The problem is the GDPR requirement that a user has the right
> >> to
> >>>>> be
> >>>>>>>>>> forgotten. Unfortunately, I can't find any API call or
> >> information
> >>>>>> on
> >>>>>>>>>> how to delete individual user data from the NiFi Provenance
> >>>>>> Repository
> >>>>>>>>>> based on a user-defined attribute and its defined
> >> characteristics.
> >>>>>>>>>> A delete request like "delete all data and dependencies where
> >> the
> >>>>>>>>>> attribute XYZ has the value 123" is currently not possible to
> >> my
> >>>>>>>> knowledge.
> >>>>>>>>>> My questions are:
> >>>>>>>>>> Is this actually possible and how? And if not, is it planned?
> >>>>>>>>>>
> >>>>>>>>>> Thanks
> >>>>>>>>>> Uwe
> >>>>>>>>>>
> >>>
>
>

Reply via email to