Syan Tan wrote: > i've processed 360,000 rows of clin.clin_narrative and parsed out all the > words > > containing letters. I was thinking of using a stoplist method where any word > appearing > > on the stoplist will be replaced by 'xxxx' . The stoplist would also include > all > the names > > listed out from dem.names.lastnames and dem.names.firstnames. > > BTW - what about a secondary structure for clin.clin_narrative, where the > narrative > > consists of a list of indexes pointing into a table of words. this is the > simplest step before > > having some sort of semantic linking at the word level ( but not at the > phrase > level). > > whilst trying to recreate the gnumed database using a pg_dump, > > the dump reload seems to stall ; I tried to turn off logging, table > constraints, removing > > internal log table data , and fsync , which all finally worked , but I'm not > sure what causes the stall. > > > > > *On Mon Apr 24 18:53 , Karsten Hilbert sent: > > * > > On Thu, Apr 20, 2006 at 09:47:54AM +0800, Syan Tan wrote: > > > thinking about it, the only correct thing to do seems to be to > preserve the > > structure of the instance data and the health issue + episode headings, > but to > > scramble the text with word substitution, as well as name > substitution, date > > fudging, and address random relinking . would that be de-identified > enough ? > Well, I tend to think that "de-identified enough" is a range > from "acceptably so" to "beyond use" rather than a cutoff. > The exact value used within that range depends on what sort > of protection you need. > > Yes, if you want to hide a patient's data securely from your > fellow doctor next door you will have to scamble the medical > content, too, as she might be able to match "real patient" > to "problems/operations listed" by her own medical skills > and thereby gain knowledge via the now re-identified EMR. > > But if you want to protect a patient's privacy from, say, > me, it's enough to falsify the identities. I do not have > access to your patients. I also have no idea how to find out > who your patients actually are in order to start matching > EMRs to patients. Hence proper protection is ensure, I dare > say. It is akin to not storing patient names with any > medical data and hold the EMR ID <-> patient identity > mapping elsewhere in a secure space (say, the patient's > brain). > > In a recent discussion on the openhealth list this topic was > chanced upon and the OpenEHR guys thought the latter > approach would be the most secure that's practically useful > - and they were talking real live patient data in actual > care.
I didn't mention it on the openEHR list (maybe I should) but merely removing the direct identifiers (names, DOB etc) does not de-identify or anonymise that data. For example, if the record reveals "32 yr old male, with medical visits on 23/4/04, 12/6/05 and 14/01/06" then that record has a very high probability of being unique to an individual in even a large population. Hence if I know your age and sex (easily discovered or ascertained) and I know that you had medical appointments on those dates (eg if I had access to your work leave records, as staff in the personnel department of your employer may have), then I can fairly easily which record belongs to you. Disclosure control in microdata almost always involves some degree of obfuscation, perturbation or allocation to broad categories - in other words, a lot of detail needs to be removed to make real data truly anonymous (in that it cannot be re-identified). Also, anonymity of data is a continuum - it is not dichotomous, and often it comes down to a risk judgement and some assumptions about what additional information an 'attacker' who might try to re-identify records might possess. If the data are to be made publicly available, you can't make any assumptions about what an attacker might or might not already know about a person, so you need to be very conservative. Tim C _______________________________________________ Gnumed-devel mailing list [email protected] http://lists.gnu.org/mailman/listinfo/gnumed-devel
