+Connie Chen <[email protected]>

Product Analytics has started to discuss backup practices on our team and
Connie Chen, our new data analyst, is documenting the outcomes of those
discussions.

Kate

--

Kate Zimmerman (she/they)
Head of Product Analytics
Wikimedia Foundation


On Mon, Jul 15, 2019 at 4:36 PM Leila Zia <[email protected]> wrote:

> All clear, Luca and Nuria. Thanks!
>
>
> On Thu, Jul 11, 2019 at 2:55 AM Luca Toscano <[email protected]>
> wrote:
> >
> > Hi Leila and Kate,
> >
> > adding a few words after Nuria's email to clarify my original intentions.
> > My point was that any important and vital file that needs to be preserved
> > may be stored in HDFS rather than on stat/notebooks due to the absence of
> > backups of the home directories. My concern was that people had a
> different
> > understanding about backups and I wanted to clarify.
> > We (as Analytics team) don't have any good way at the moment to
> > periodically scan HDFS and home directories across hosts to find PII data
> > that is retained more than the allowed period of time. The main
> motivation
> > is that we'd need to find a way to check a huge amount of files, with
> > different names and formats, and figure out if the data contained in them
> > is PII and retained more than X days. It is not an impossible task but
> not
> > easy or trivial, we'd need a lot more staff in my opinion to create and
> > maintain something similar :) We started recently with the clean up of
> old
> > home directories (i.e. belonging to users not active anymore) and we
> > established a process with SRE to get pinged when a user is offboarded to
> > verify what data should be kept and what not (I know that both of you are
> > aware of this since you have been working with us on several tasks, I am
> > writing it to allow other people to get the context :). This is only a
> > starting point, I really hope to have something more robust and complete
> in
> > the future. In the meantime, I'd say that every user is responsible of
> the
> > data that he/she handles on the Analytics infrastructure, periodically
> > reviewing it and deleting when necessary. I don't have a specific
> > guideline/process to suggest, but we can definitely have a chat together
> > and decide something shared among our teams!
> >
> > Let me know if this makes sense or not :)
> >
> > Thanks,
> >
> > Luca
> >
> > Il giorno mer 10 lug 2019 alle ore 23:15 Nuria Ruiz <[email protected]
> >
> > ha scritto:
> >
> > > >I have one question for you: As you allow/encourage for more copies of
> > > >the files to exist
> > > To be extra clear, we do not encourage for data to be in that notebooks
> > > hosts at all, there is no capacity of them to neither process nor hosts
> > > large amounts of data. Data that you are working with is best placed on
> > > /user/your-username databse in hadoop so far from encouraging multiple
> > > copies we are rather encouraging you keep the data outside the notebook
> > > machines.
> > >
> > > Thanks,
> > >
> > > Nuria
> > >
> > > On Wed, Jul 10, 2019 at 11:13 AM Kate Zimmerman <
> [email protected]>
> > > wrote:
> > >
> > >> I second Leila's question. The issue of how we flag PII data and
> ensure
> > >> it's appropriately scrubbed came up in our team meeting yesterday.
> We're
> > >> discussing team practices for data/project backups tomorrow and plan
> to
> > >> come out with some proposals, at least for the short term.
> > >>
> > >> Are there any existing processes or guidelines I should be aware of?
> > >>
> > >> Thanks!
> > >> Kate
> > >>
> > >> --
> > >>
> > >> Kate Zimmerman (she/they)
> > >> Head of Product Analytics
> > >> Wikimedia Foundation
> > >>
> > >>
> > >> On Wed, Jul 10, 2019 at 9:00 AM Leila Zia <[email protected]>
> wrote:
> > >>
> > >>> Hi Luca,
> > >>>
> > >>> Thanks for the heads up. Isaac is coordinating a response from the
> > >>> Research side.
> > >>>
> > >>> I have one question for you: As you allow/encourage for more copies
> of
> > >>> the files to exist, what is the mechanism you'd like to put in place
> > >>> for reducing the chances of PII to be copied in new folders that then
> > >>> will be even harder (for your team) to keep track of? Having an
> > >>> explicit process/understanding about this will be very helpful.
> > >>>
> > >>> Thanks,
> > >>> Leila
> > >>>
> > >>>
> > >>> On Thu, Jul 4, 2019 at 3:14 AM Luca Toscano <[email protected]>
> > >>> wrote:
> > >>> >
> > >>> > Hi everybody,
> > >>> >
> > >>> > as part of https://phabricator.wikimedia.org/T201165 the Analytics
> > >>> team
> > >>> > thought to reach out to everybody to make it clear that all the
> home
> > >>> > directories on the stat/notebook nodes are not backed up
> periodically.
> > >>> They
> > >>> > run on a software RAID configuration spanning multiple disks of
> > >>> course, so
> > >>> > we are resilient on a disk failure, but even if unlikely if might
> > >>> happen
> > >>> > that a host could loose all its data. Please keep this in mind when
> > >>> working
> > >>> > on important projects and/or handling important data that you care
> > >>> about.
> > >>> >
> > >>> > I just added a warning to
> > >>> >
> > >>>
> https://wikitech.wikimedia.org/wiki/Analytics/Data_access#Analytics_clients
> > >>> .
> > >>> > If you have really important data that is too big to backup, keep
> in
> > >>> mind
> > >>> > that you can use your home directory (/user/your-username) on HDFS
> > >>> (that
> > >>> > replicates data three times across multiple nodes).
> > >>> >
> > >>> > Please let us know if you have comments/suggestions/etc.. in the
> > >>> > aforementioned task.
> > >>> >
> > >>> > Thanks in advance!
> > >>> >
> > >>> > Luca (on behalf of the Analytics team)
> > >>> > _______________________________________________
> > >>> > Wiki-research-l mailing list
> > >>> > [email protected]
> > >>> > https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
> > >>>
> > >>>
> > >>> _______________________________________________
> > >> Analytics mailing list
> > >> [email protected]
> > >> https://lists.wikimedia.org/mailman/listinfo/analytics
> > >>
> > > _______________________________________________
> > > Analytics mailing list
> > > [email protected]
> > > https://lists.wikimedia.org/mailman/listinfo/analytics
> > >
> > _______________________________________________
> > Wiki-research-l mailing list
> > [email protected]
> > https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>
>
>
_______________________________________________
Analytics mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/analytics

Reply via email to