That sounds like it would be perfect for this task On Monday, September 29, 2014, Peter Szolovits <[email protected]> wrote:
> I have a set of about 27K documents from MIMIC (circa 2009) in which I > have replaced the weird PHI markers by synthesized pseudonymous data. > These have natural sentence breaks (typically in the middle of lines), > normal paragraph structure, bulleted lists, etc. Assuming it goes to > people who have signed the MIMIC DUA, I could provide these if you are > interested. --Pete Sz. > > On Sep 29, 2014, at 1:37 PM, Miller, Timothy < > [email protected] <javascript:;>> wrote: > > > Some of them are a bit artificial for this task, with notes being > > annotated as one sentence per line and offset punctuation. I think maybe > > the 2008 and 2009 data might have original formatting though, with > > newlines not always breaking sentences. That has certain advantages over > > raw MIMIC for training since the PHI isn't so weirdly formatted, but > > then again is not a mix of styles (that is, the styles of newline always > > terminates sentence vs. sometimes terminates sentence). I think it would > > still have to be paired with another dataset to be a representative > sample. > > Tim > > > > On 09/29/2014 01:24 PM, vijay garla wrote: > >> Why not use the i2b2 corpora? > >> > >> On Monday, September 29, 2014, Dligach, Dmitriy < > >> [email protected] <javascript:;>> wrote: > >> > >>> Maybe creating a made-up set of sentences would be an option? That way > we > >>> could agree on the annotation of concrete cases. Although this would be > >>> more of a unit test than a corpus. > >>> > >>> Dima > >>> > >>> > >>> > >>> > >>> On Sep 27, 2014, at 12:15, Miller, Timothy < > >>> [email protected] <javascript:;> <javascript:;>> > wrote: > >>> > >>>> I've just been using the opennlp command line cross validator on the > >>> small dataset i annotated (along with some eyeballing). It would be > cool if > >>> there was a standard clinical resource available for this task, but I > >>> hadn't considered it much because the data I annotated pulls from > multiple > >>> datasets and the process of arranging with different institutions to > make > >>> something like that available would probably be a nightmare. > >>>> Tim > >>>> > >>>> Sent from my iPad. Sorry about the typos. > >>>> > >>>>> On Sep 27, 2014, at 12:16 PM, "Dligach, Dmitriy" < > >>> [email protected] <javascript:;> <javascript:;>> > wrote: > >>>>> Tim, thanks for working on this! > >>>>> > >>>>> Question: do we have some formal way of evaluating the sentence > >>> detector? Maybe we should come up with some dev set that would include > >>> examples from mimic... > >>>>> Dima > >>>>> > >>>>> > >>>>> > >>>>> > >>>>>> On Sep 27, 2014, at 8:57, Miller, Timothy < > >>> [email protected] <javascript:;> <javascript:;>> > wrote: > >>>>>> I have been working on the sentence detector newline issue, > training a > >>> model to probabilistically split sentences on newlines rather than > forcing > >>> sentence breaks. I have checked in a model to the repo under > >>> ctakes-core-res. I also attached a patch to ctakes-core to the jira > issue: > >>>>>> https://issues.apache.org/jira/browse/CTAKES-41 > >>>>>> > >>>>>> for people to test. The status of my testing is that it doesn't seem > >>> to break on notes where ctakes worked well before (those where > newlines are > >>> always sentence breaks), and is a slight improvement on notes where > >>> newlines may or may not be sentence breaks. Once the change is checked > in > >>> we can continue improving the model by adding more data and features, > but > >>> the first hurdle I'd like to get past is making sure it runs well > enough on > >>> the type of data that the old model worked well on. Let me know if you > have > >>> any questions. > >>>>>> Thanks > >>>>>> Tim > >>> > > > > -- -- Karthik Sarma UCLA Medical Scientist Training Program Class of 20?? Member, UCLA Medical Imaging & Informatics Lab Member, CA Delegation to the House of Delegates of the American Medical Association [email protected] gchat: [email protected] linkedin: www.linkedin.com/in/ksarma
