PLEASE CONTACT jon.gib...@manchester.ac.uk for further details ESNW/MRCCS Seminar Announcement
"A new algorithm for finding minimal sample uniques for use in statistical disclosure assessment" Friday, 7th October 2005, 2-3p.m. BST Venue: Room 1.10 (ESNW Access Grid) Kilburn Building http://www.esnw.ac.uk/n-events.shtml --- Dr. Anna Manning The Centre for Novel Computing Department of Computer Science University of Manchester Improvements and innovations in computer processing power, disk storage and networks have led to dramatic increases in the ability to accumulate and analyze personal data. However, if personal data is made available, even in an anonymized form, there is a risk of individuals being identified using statistical disclosure through the matching of known information with the anonymized data, resulting in material specific to those individuals being revealed. This work focuses on the identification of individual records with a high risk of disclosure, a process otherwise known as Statistical Disclosure Assessment. The records belonging to certain individuals have a significant chance of being identified as their contents, or attributes, are unique and therefore have the potential to be matched directly with details (including names and addresses) from another dataset. An illustration of a `risky' record of this type is a sixteen-year-old widow in a population survey. The ability to comprehensively locate and grade such records leads to more efficient Statistical Disclosure Control (SDC) of released data. A sequential algorithm, SUDA (Special Unique Detection Algorithm), has been designed and implemented specifically for this problem and is currently used by the Office for National Statistics (ONS) in London. Only unique attribute sets without any unique subsets --- Minimal Sample Uniques (MSUs) --- are considered in order to avoid the use of redundant information and to keep the classification process as focused as possible. Although SUDA has greatly increased the depth of risk assessment possible, the demanding levels of execution time required to find all MSUs mean that it is restricted to small datasets, particularly in terms of the number of columns that they possess. This talk presents SUDA2, a recursive algorithm for finding MSUs. SUDA2 uses a novel method for representing the search space for MSUs and new observations about the properties of MSUs to prune and traverse this space. Experimental comparisons with SUDA demonstrate that SUDA2 is not only several orders of magnitude faster but is also capable of identifying the boundaries of the search space, enabling datasets of larger numbers of columns than before to be addressed.