+Connie Chen <[email protected]> Product Analytics has started to discuss backup practices on our team and Connie Chen, our new data analyst, is documenting the outcomes of those discussions.
Kate -- Kate Zimmerman (she/they) Head of Product Analytics Wikimedia Foundation On Mon, Jul 15, 2019 at 4:36 PM Leila Zia <[email protected]> wrote: > All clear, Luca and Nuria. Thanks! > > > On Thu, Jul 11, 2019 at 2:55 AM Luca Toscano <[email protected]> > wrote: > > > > Hi Leila and Kate, > > > > adding a few words after Nuria's email to clarify my original intentions. > > My point was that any important and vital file that needs to be preserved > > may be stored in HDFS rather than on stat/notebooks due to the absence of > > backups of the home directories. My concern was that people had a > different > > understanding about backups and I wanted to clarify. > > We (as Analytics team) don't have any good way at the moment to > > periodically scan HDFS and home directories across hosts to find PII data > > that is retained more than the allowed period of time. The main > motivation > > is that we'd need to find a way to check a huge amount of files, with > > different names and formats, and figure out if the data contained in them > > is PII and retained more than X days. It is not an impossible task but > not > > easy or trivial, we'd need a lot more staff in my opinion to create and > > maintain something similar :) We started recently with the clean up of > old > > home directories (i.e. belonging to users not active anymore) and we > > established a process with SRE to get pinged when a user is offboarded to > > verify what data should be kept and what not (I know that both of you are > > aware of this since you have been working with us on several tasks, I am > > writing it to allow other people to get the context :). This is only a > > starting point, I really hope to have something more robust and complete > in > > the future. In the meantime, I'd say that every user is responsible of > the > > data that he/she handles on the Analytics infrastructure, periodically > > reviewing it and deleting when necessary. I don't have a specific > > guideline/process to suggest, but we can definitely have a chat together > > and decide something shared among our teams! > > > > Let me know if this makes sense or not :) > > > > Thanks, > > > > Luca > > > > Il giorno mer 10 lug 2019 alle ore 23:15 Nuria Ruiz <[email protected] > > > > ha scritto: > > > > > >I have one question for you: As you allow/encourage for more copies of > > > >the files to exist > > > To be extra clear, we do not encourage for data to be in that notebooks > > > hosts at all, there is no capacity of them to neither process nor hosts > > > large amounts of data. Data that you are working with is best placed on > > > /user/your-username databse in hadoop so far from encouraging multiple > > > copies we are rather encouraging you keep the data outside the notebook > > > machines. > > > > > > Thanks, > > > > > > Nuria > > > > > > On Wed, Jul 10, 2019 at 11:13 AM Kate Zimmerman < > [email protected]> > > > wrote: > > > > > >> I second Leila's question. The issue of how we flag PII data and > ensure > > >> it's appropriately scrubbed came up in our team meeting yesterday. > We're > > >> discussing team practices for data/project backups tomorrow and plan > to > > >> come out with some proposals, at least for the short term. > > >> > > >> Are there any existing processes or guidelines I should be aware of? > > >> > > >> Thanks! > > >> Kate > > >> > > >> -- > > >> > > >> Kate Zimmerman (she/they) > > >> Head of Product Analytics > > >> Wikimedia Foundation > > >> > > >> > > >> On Wed, Jul 10, 2019 at 9:00 AM Leila Zia <[email protected]> > wrote: > > >> > > >>> Hi Luca, > > >>> > > >>> Thanks for the heads up. Isaac is coordinating a response from the > > >>> Research side. > > >>> > > >>> I have one question for you: As you allow/encourage for more copies > of > > >>> the files to exist, what is the mechanism you'd like to put in place > > >>> for reducing the chances of PII to be copied in new folders that then > > >>> will be even harder (for your team) to keep track of? Having an > > >>> explicit process/understanding about this will be very helpful. > > >>> > > >>> Thanks, > > >>> Leila > > >>> > > >>> > > >>> On Thu, Jul 4, 2019 at 3:14 AM Luca Toscano <[email protected]> > > >>> wrote: > > >>> > > > >>> > Hi everybody, > > >>> > > > >>> > as part of https://phabricator.wikimedia.org/T201165 the Analytics > > >>> team > > >>> > thought to reach out to everybody to make it clear that all the > home > > >>> > directories on the stat/notebook nodes are not backed up > periodically. > > >>> They > > >>> > run on a software RAID configuration spanning multiple disks of > > >>> course, so > > >>> > we are resilient on a disk failure, but even if unlikely if might > > >>> happen > > >>> > that a host could loose all its data. Please keep this in mind when > > >>> working > > >>> > on important projects and/or handling important data that you care > > >>> about. > > >>> > > > >>> > I just added a warning to > > >>> > > > >>> > https://wikitech.wikimedia.org/wiki/Analytics/Data_access#Analytics_clients > > >>> . > > >>> > If you have really important data that is too big to backup, keep > in > > >>> mind > > >>> > that you can use your home directory (/user/your-username) on HDFS > > >>> (that > > >>> > replicates data three times across multiple nodes). > > >>> > > > >>> > Please let us know if you have comments/suggestions/etc.. in the > > >>> > aforementioned task. > > >>> > > > >>> > Thanks in advance! > > >>> > > > >>> > Luca (on behalf of the Analytics team) > > >>> > _______________________________________________ > > >>> > Wiki-research-l mailing list > > >>> > [email protected] > > >>> > https://lists.wikimedia.org/mailman/listinfo/wiki-research-l > > >>> > > >>> > > >>> _______________________________________________ > > >> Analytics mailing list > > >> [email protected] > > >> https://lists.wikimedia.org/mailman/listinfo/analytics > > >> > > > _______________________________________________ > > > Analytics mailing list > > > [email protected] > > > https://lists.wikimedia.org/mailman/listinfo/analytics > > > > > _______________________________________________ > > Wiki-research-l mailing list > > [email protected] > > https://lists.wikimedia.org/mailman/listinfo/wiki-research-l > > >
_______________________________________________ Analytics mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/analytics
