>I have one question for you: As you allow/encourage for more copies of >the files to exist To be extra clear, we do not encourage for data to be in that notebooks hosts at all, there is no capacity of them to neither process nor hosts large amounts of data. Data that you are working with is best placed on /user/your-username databse in hadoop so far from encouraging multiple copies we are rather encouraging you keep the data outside the notebook machines.
Thanks, Nuria On Wed, Jul 10, 2019 at 11:13 AM Kate Zimmerman <[email protected]> wrote: > I second Leila's question. The issue of how we flag PII data and ensure > it's appropriately scrubbed came up in our team meeting yesterday. We're > discussing team practices for data/project backups tomorrow and plan to > come out with some proposals, at least for the short term. > > Are there any existing processes or guidelines I should be aware of? > > Thanks! > Kate > > -- > > Kate Zimmerman (she/they) > Head of Product Analytics > Wikimedia Foundation > > > On Wed, Jul 10, 2019 at 9:00 AM Leila Zia <[email protected]> wrote: > >> Hi Luca, >> >> Thanks for the heads up. Isaac is coordinating a response from the >> Research side. >> >> I have one question for you: As you allow/encourage for more copies of >> the files to exist, what is the mechanism you'd like to put in place >> for reducing the chances of PII to be copied in new folders that then >> will be even harder (for your team) to keep track of? Having an >> explicit process/understanding about this will be very helpful. >> >> Thanks, >> Leila >> >> >> On Thu, Jul 4, 2019 at 3:14 AM Luca Toscano <[email protected]> >> wrote: >> > >> > Hi everybody, >> > >> > as part of https://phabricator.wikimedia.org/T201165 the Analytics team >> > thought to reach out to everybody to make it clear that all the home >> > directories on the stat/notebook nodes are not backed up periodically. >> They >> > run on a software RAID configuration spanning multiple disks of course, >> so >> > we are resilient on a disk failure, but even if unlikely if might happen >> > that a host could loose all its data. Please keep this in mind when >> working >> > on important projects and/or handling important data that you care >> about. >> > >> > I just added a warning to >> > >> https://wikitech.wikimedia.org/wiki/Analytics/Data_access#Analytics_clients >> . >> > If you have really important data that is too big to backup, keep in >> mind >> > that you can use your home directory (/user/your-username) on HDFS (that >> > replicates data three times across multiple nodes). >> > >> > Please let us know if you have comments/suggestions/etc.. in the >> > aforementioned task. >> > >> > Thanks in advance! >> > >> > Luca (on behalf of the Analytics team) >> > _______________________________________________ >> > Wiki-research-l mailing list >> > [email protected] >> > https://lists.wikimedia.org/mailman/listinfo/wiki-research-l >> >> >> _______________________________________________ > Analytics mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/analytics >
_______________________________________________ Analytics mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/analytics
