Thanks Jodi, Alec, Joss, Robert and Nick That's really helpful - and very quick! A lot to digest - but it looks pretty devastating for the piece, confirming my suspicions.
Many thanks again Paul Dr Paul Bernal Lecturer UEA Law School University of East Anglia Norwich Research Park Norwich NR4 7TJ email: [email protected]<mailto:[email protected]> Web: http://www.paulbernal.co.uk/ Blog: http://paulbernal.wordpress.com/ Twitter: @paulbernalUK On 10 Sep 2012, at 03:20, Robert Munro <[email protected]<mailto:[email protected]>> wrote: I second the criticism about the assumptions of a 'perfect population register'. This is a much broader problem, as shown by the Netflix case. For a good synopsis, see Pete Warden's take on the problem, some examples of how external data can be used to help reverse anonymized data, and some suggestions for ways to operate with imperfect anonymization: http://strata.oreilly.com/2011/05/anonymize-data-limits.html You certainly don't need to be high-profile, either, like the article suggests. Last year I was working on disease outbreak tracking. There was an actual case where a girl in East Africa had been reported as testing positive to Ebola. Her village was named in reports and this was a region where victims of diseases are often vilified and sometimes killed. She would have likely been the only person from her village who was rushed to a hospital at that time (and more likely the only girl of her age-bracket). It would have been simple for everyone from her village to immediately make the connection. We decided we would not want to publish this information, even though many other health organizations did. Her diagnosis was ultimately incorrect, which doesn't really affect the anonymization issue, but it makes any identification/vilification even more disturbing. We were information managers and health professionals, not lawyers, and the international aspect no doubt complicates things. I assume that the health organizations who did publicize this acted within the law. For us, this wasn't enough. If it was reported in a health journal 5 years later? That might be ok. But as real-time report it was clearly unethical. I doubt the other organizations published this in malice - it was one piece of information among many - but it highlights the problem. Rob On 9 September 2012 15:30, Joss Wright <[email protected]<mailto:[email protected]>> wrote: On Sun, Sep 09, 2012 at 07:19:22PM +0000, Paul Bernal (LAW) wrote: I wondered if anyone had an opinion on it - I don't have the technical knowledge to be able to evaluate it properly. The basic conclusion seems to be that re-identification of 'anonymised' data is not nearly as easy as we had previously thought (from the work of Latanya Sweeney, Paul Ohm etc). Are these conclusions valid? My concern is that I can see this paper being used to justify all kinds of potentially risky information being released - particularly health data, which could get into the hands of insurance companies and others who could use it to the detriment of individuals. On the other hand, if the conclusions are really valid, then perhaps people like me shouldn't be as concerned as we are. Hi Paul, I've gone over this paper quite quickly, partially because it's late here and I should be asleep; apologies for any bizarre turns of phrase, repetition (hesitation or deviation...), or bad-tempered comments. :) I'll also certainly defer to the hardcore reidentification experts if they turn up. (This email has become slightly longer than I intended. To sum up: "Lots of problems. False assumptions. Cherry-picked examples. Ignores or wholly misunderstands subsequent decade of research. Somewhat misrepresents statistics. Wishful-thinking recommendations. Correct in stating that we don't need to delete all data everywhere in order to avoid reidentification, but that's about it.") My initial response is that the paper is partially correct, in that the Sweeney example was a dramatic, anecdotal demonstration of reidentification and shouldn't be taken as representative of data in general. On the other hand, the paper goes wildly off in the other direction, and claims that the specifics of the Sweeney example somehow demonstrate that reidentification in general is barely feasible and can easily be handled with a few simple rules of thumb. Overall, I would say that there are a number of serious flaws in the arguments of the author. Firstly, the paper is predicated almost entirely on what the author refers to as `the myth of the perfect population register' -- that almost no realistic database covers an entire population, and so any apparently unique record could in fact also match someone outside of the database. This is certainly true, but is used by the author to justify an assumption that does not hold, in my opinion. This assumption, the largest conceptual flaw in the paper, is that a reidentification has to be unique and perfect to be of any value. The author claims, based on the `perfect population register', that because some reidentified record, relating to, say, health information of an individual, could potentially match that of someone that wasn't in the database, that there is no guarantee that the record is accurate, and thus the reidentification is useless. This is not true -- even such partial or probabilistic reidentifications reduce the set of possibilities, and reveal information regarding an individual. This can be used and combined with further data sources to achieve either reidentification, if that is the goal, or simply the revelation of sensitive personal information. As an example: Sweeney used William Weld's unique characteristics in the voter database to reidentify his anonymous health data. As some hypothetical `Person X' who was not in the voter database could have matched those apparently unique characteristics, the anonymous health data could have belonged to Person X rather than William Weld. As the author notes, this is overcome in the Sweeney case by making use of public information to confirm that the data was that of William Weld -- the author seems to believe that any such auxiliary information for other individuals could not reasonably exist, despite the existence of Google and Facebook. The author takes from this that any partial or probabilistic reidentification is therefore worthless, and claims that it was only the widely publicized `auxiliary information' about William Weld's health status that made such reidentification possible. What the author fails to address is that the availability of such auxiliary information is exactly what is being made available with greater and greater frequency by the release of poorly-anonymised databases. As such, whilst the initial reidentification cannot be made with perfect accuracy, subsequent pieces of auxiliary information can be used to verify, research and identify an individual. (Of course, an attacker may simply be seeking to gain a given piece of sensitive information, so a true `reidentification' may not be a useful goal in considering the risks of such databases.) The author states in the abstract that `... most re-identification attempts face a strong challenge in being able to create a complete and accurate population register', and claims that this strong assumption underlies most other reidentification work. (Using the entirely objective phrase `somewhat furtive "insider" trade secret'.) In fact, this strong assumption is entirely too strong, and is given as an assumption only by the author themselves. I would point to the seminal Shmatikov and Narayanan work on the Netflix Prize for a deeper analysis that shatters exactly this kind of assumption. This claim by the author is somewhat of a strawman argument, and one on which the entire paper is based. A second flaw comes in switching several times, according to the argument needed, as to whether the attacker is interested in identifying a targeted individual (`We need William Weld's data'), or whether any individual will do (`We need someone's data, but don't care who it is.'). These raise very different problems, and different sets of statistics, and need to be clearly separated in analysis. A third flaw, related to the first and epitomised by the section starting with the final paragraph of page 6, is that an attacker would need to somehow build their perfect database before reidentifying an individual. The author states that the attacker would have to check all other individuals outside of the original database to complete the reidentification. In fact, they could simply seek alternative forms of auxiliary information to make their reidentification more and more certain. I do find it bizarre that the author makes this claim, as the more intelligent approach of using auxiliary information is precisely that employed by Sweeney in the case of William Weld. The author does address the problem of probabilistic reidentification at the latter stages of the paper (top of page 9) but dismisses it entirely, and unreasonably, out of hand. I could write a whole essay on this particular argument, but I'll simply note that with a 35% chance for error, you simply have a very good starting point to find extra auxiliary information to reduce your error to whatever you decide is acceptable. (This should not be ignored, however, as the author's insistence that reidentification must be 100% certain is probably the deepest flaw here.) A more worrying problem comes in the surprising lack of coverage of any of the subsequent, and equally highly publicized, reidentification attacks, or any of the developments in anonymisation since k-anonymity. Even if we brush aside the vast amount of work on differential privacy, which is extremely popular in anonymity research today, the author has not addressed concepts such as l-diversity or t-closeness, which would seem necessary for a reasonable study. (As a quick example, consider this application of an l-diversity problem: We cannot identify William Weld uniquely in the health database, but we can isolate him as one of four people. All of those four have been prescribed antidepressants in the last six months, and three are being treated for an STD. No perfect reidentification, but certainly a sensitive data leak for the poor governor.) The total lack of coverage of, for example, Shmatikov and Narayanan's reidentification of the Netflix Prize dataset, and the (wonderful) analysis and methodology used there show a worrying lack of familiarity with the state of the art, and certainly call into question the conclusions drawn from the author's analysis. I do find the total focus on the Sweeney example, and the picking apart of the details, a very concerning example of the kind of thinking that often surrounds anonymisation: that by fixing the specific problem that you identify with a specific example, you can fix the wider problem. This is a `patching up the holes' approach, rather than an attempt to systemically fix a problem; this has rarely been shown an effective strategy, particularly in computer security. ("This was caused by a combination of gender, birthdate and zip code? Quick, make those sensitive pieces of data!") The recommendations at the end of the paper are simply unrealistic. Point by point: 1) Make it illegal to reidentify data -- this approach has been criticised at length in the literature, as the author acknowledges and dismisses, but I would focus particularly here on how difficult it is to detect reidentification attempts. This will stop only the most ethical of attackers. 2) Require anyone linking in new data to maintain anonymity -- recognizes the problem of auxiliary information, but somehow ignores it at the same time. 3) Give data `anonymous' status, but allow that status to be withdrawn -- I assume that all the copies of the dataset will automatically self-destruct once this status is withdrawn. 4) Specify that recipients must comply with restrictions -- if you can state this then you have already solved most of the world's problems. More seriously, this (and other recommendations here) seem to conflate anonymisation that is shared with trusted researchers, which /is/ less of a problem, with anonymisation that is released to the public. If you are restricting access, there are a lot of extra approaches that you can employ. This is extremely important to understand, as the public release of data continually combines to provide more and more auxiliary data. This is why it is critically important that data for public release is anonymised, as there is no realistic way to pull that data back once it is in the public domain. All information is auxiliary information for the next attack. 5) Require that data holders are secure -- again, this is a fine wish, but gives nothing practical. 6) Data use agreements that pass on to further recipients -- trust is not commutative, and this holds most of the same wishful thinking problems as the other recommendations here. All of these recommendations are based on an assumption of trust, good faith and playing by the rules. In short, entirely the opposite of conventional security-based thinking. While we shouldn't throw away everything to meet some puritanical ideal of security, we shouldn't ignore an entire field of study because we don't like their conclusions. I don't entirely dismiss the need for a regulatory approach to this. In fact, several of these recommendations are reasonable if combined with other, stronger, guarantees. There should be penalties for misuse of data, or poor anonymisation, but they should be backed up at the technical level by effective techniques that can safeguard information. More importantly, none of these recommendations provide any kind of practical or constructive approach to best practice for anonymising data, or how to weigh up the risk or effects of data release. This seems to follow the overall tone of the paper that these risks are not a concern. The final conclusions of the paper are that the Sweeney example was not representative, and I agree; I also wholly disagree with almost all of the analysis and conclusions of the paper. From the choice use of language regarding, particularly, the `somewhat furtive "insider" trade secret', the author clearly believes that researchers into reidentification are massively and knowingly overplaying the chances of reidentification. I resent that. The one point on which I do agree is that there needs to be a balance between the benefits of access to large-scale databases, and the risks of reidentification. Where that point of balance should be is, I think, something on which I would strongly disagree with the author; although perhaps not as much as one might think. I do fully appreciate that the author comes from the perspective of wanting to use data for the greater good, and that some claims of the risks of database release are overly cautious. This paper, though, massively overstates the difficulties, and massively understates the risks. We should have a better understanding of the actual risks of reidentification, and weigh this against the benefits from access to aggregate personal data. The way to do this, however, is in a broad-based study of the real-world risks, research into the means for reidentification and anonymisation, and a systemic approach to the protection of personal data; not by hand-waving away the risks by picking apart one unrepresentative example and ignoring the subsequent decade of active research into the area. Happy to answer any other questions, on- or off-list. Joss -- Unsubscribe, change to digest, or change password at: https://mailman.stanford.edu/mailman/listinfo/liberationtech -- Idibon www.idibon.com<http://www.idibon.com> www.robertmunro.com -- Unsubscribe, change to digest, or change password at: https://mailman.stanford.edu/mailman/listinfo/liberationtech
-- Unsubscribe, change to digest, or change password at: https://mailman.stanford.edu/mailman/listinfo/liberationtech
