All clear, Luca and Nuria. Thanks!
On Thu, Jul 11, 2019 at 2:55 AM Luca Toscano <[email protected]> wrote: > > Hi Leila and Kate, > > adding a few words after Nuria's email to clarify my original intentions. > My point was that any important and vital file that needs to be preserved > may be stored in HDFS rather than on stat/notebooks due to the absence of > backups of the home directories. My concern was that people had a different > understanding about backups and I wanted to clarify. > We (as Analytics team) don't have any good way at the moment to > periodically scan HDFS and home directories across hosts to find PII data > that is retained more than the allowed period of time. The main motivation > is that we'd need to find a way to check a huge amount of files, with > different names and formats, and figure out if the data contained in them > is PII and retained more than X days. It is not an impossible task but not > easy or trivial, we'd need a lot more staff in my opinion to create and > maintain something similar :) We started recently with the clean up of old > home directories (i.e. belonging to users not active anymore) and we > established a process with SRE to get pinged when a user is offboarded to > verify what data should be kept and what not (I know that both of you are > aware of this since you have been working with us on several tasks, I am > writing it to allow other people to get the context :). This is only a > starting point, I really hope to have something more robust and complete in > the future. In the meantime, I'd say that every user is responsible of the > data that he/she handles on the Analytics infrastructure, periodically > reviewing it and deleting when necessary. I don't have a specific > guideline/process to suggest, but we can definitely have a chat together > and decide something shared among our teams! > > Let me know if this makes sense or not :) > > Thanks, > > Luca > > Il giorno mer 10 lug 2019 alle ore 23:15 Nuria Ruiz <[email protected]> > ha scritto: > > > >I have one question for you: As you allow/encourage for more copies of > > >the files to exist > > To be extra clear, we do not encourage for data to be in that notebooks > > hosts at all, there is no capacity of them to neither process nor hosts > > large amounts of data. Data that you are working with is best placed on > > /user/your-username databse in hadoop so far from encouraging multiple > > copies we are rather encouraging you keep the data outside the notebook > > machines. > > > > Thanks, > > > > Nuria > > > > On Wed, Jul 10, 2019 at 11:13 AM Kate Zimmerman <[email protected]> > > wrote: > > > >> I second Leila's question. The issue of how we flag PII data and ensure > >> it's appropriately scrubbed came up in our team meeting yesterday. We're > >> discussing team practices for data/project backups tomorrow and plan to > >> come out with some proposals, at least for the short term. > >> > >> Are there any existing processes or guidelines I should be aware of? > >> > >> Thanks! > >> Kate > >> > >> -- > >> > >> Kate Zimmerman (she/they) > >> Head of Product Analytics > >> Wikimedia Foundation > >> > >> > >> On Wed, Jul 10, 2019 at 9:00 AM Leila Zia <[email protected]> wrote: > >> > >>> Hi Luca, > >>> > >>> Thanks for the heads up. Isaac is coordinating a response from the > >>> Research side. > >>> > >>> I have one question for you: As you allow/encourage for more copies of > >>> the files to exist, what is the mechanism you'd like to put in place > >>> for reducing the chances of PII to be copied in new folders that then > >>> will be even harder (for your team) to keep track of? Having an > >>> explicit process/understanding about this will be very helpful. > >>> > >>> Thanks, > >>> Leila > >>> > >>> > >>> On Thu, Jul 4, 2019 at 3:14 AM Luca Toscano <[email protected]> > >>> wrote: > >>> > > >>> > Hi everybody, > >>> > > >>> > as part of https://phabricator.wikimedia.org/T201165 the Analytics > >>> team > >>> > thought to reach out to everybody to make it clear that all the home > >>> > directories on the stat/notebook nodes are not backed up periodically. > >>> They > >>> > run on a software RAID configuration spanning multiple disks of > >>> course, so > >>> > we are resilient on a disk failure, but even if unlikely if might > >>> happen > >>> > that a host could loose all its data. Please keep this in mind when > >>> working > >>> > on important projects and/or handling important data that you care > >>> about. > >>> > > >>> > I just added a warning to > >>> > > >>> https://wikitech.wikimedia.org/wiki/Analytics/Data_access#Analytics_clients > >>> . > >>> > If you have really important data that is too big to backup, keep in > >>> mind > >>> > that you can use your home directory (/user/your-username) on HDFS > >>> (that > >>> > replicates data three times across multiple nodes). > >>> > > >>> > Please let us know if you have comments/suggestions/etc.. in the > >>> > aforementioned task. > >>> > > >>> > Thanks in advance! > >>> > > >>> > Luca (on behalf of the Analytics team) > >>> > _______________________________________________ > >>> > Wiki-research-l mailing list > >>> > [email protected] > >>> > https://lists.wikimedia.org/mailman/listinfo/wiki-research-l > >>> > >>> > >>> _______________________________________________ > >> Analytics mailing list > >> [email protected] > >> https://lists.wikimedia.org/mailman/listinfo/analytics > >> > > _______________________________________________ > > Analytics mailing list > > [email protected] > > https://lists.wikimedia.org/mailman/listinfo/analytics > > > _______________________________________________ > Wiki-research-l mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/wiki-research-l _______________________________________________ Analytics mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/analytics
