Britt and I (ctakes committers) work on exactly this problem. We have used cTakes to train models for HIPAA de-identification.
In a nutshell: the answer depends what the IRB considers "de-identified". Hashing is not allowed by any IRB that I am aware of. On May 18, 2013, at 12:43 PM, Alexander Measure <[email protected]> wrote: > In my day job I train text classifiers that are useful for a wide variety > of health surveillance tasks. The data used to train these classifiers > however cannot be shared because of confidentiality protections. I would > like to make these trained models available to others just as cTAKES does, > but I'm not sure how. Can you tell me how cTAKES does it, or point me to > resources that might be useful? > > My models tend to be regularized logistic regression models trained on > bag-of-words type features. I suspect that I can get some protection by > hashing everything to a fixed space first, but if there's a different > well-established approach out there I'd rather use that. > > Alex Measure
