That does sound like it would be useful since MIMIC does have both kinds of linebreak styles in different notes. If I did some annotations on such a dataset would it be re-distributable, say on the physionet website? I believe the ShARe project has a download site there (it is a layer of annotations on MIMIC). Another option would be you posting your raw data there and I could post offset-based annotations on a public repo like github. Tim
On 09/29/2014 01:54 PM, Peter Szolovits wrote: > I have a set of about 27K documents from MIMIC (circa 2009) in which I have > replaced the weird PHI markers by synthesized pseudonymous data. These have > natural sentence breaks (typically in the middle of lines), normal paragraph > structure, bulleted lists, etc. Assuming it goes to people who have signed > the MIMIC DUA, I could provide these if you are interested. --Pete Sz. > > On Sep 29, 2014, at 1:37 PM, Miller, Timothy > <[email protected]> wrote: > >> Some of them are a bit artificial for this task, with notes being >> annotated as one sentence per line and offset punctuation. I think maybe >> the 2008 and 2009 data might have original formatting though, with >> newlines not always breaking sentences. That has certain advantages over >> raw MIMIC for training since the PHI isn't so weirdly formatted, but >> then again is not a mix of styles (that is, the styles of newline always >> terminates sentence vs. sometimes terminates sentence). I think it would >> still have to be paired with another dataset to be a representative sample. >> Tim >> >> On 09/29/2014 01:24 PM, vijay garla wrote: >>> Why not use the i2b2 corpora? >>> >>> On Monday, September 29, 2014, Dligach, Dmitriy < >>> [email protected]> wrote: >>> >>>> Maybe creating a made-up set of sentences would be an option? That way we >>>> could agree on the annotation of concrete cases. Although this would be >>>> more of a unit test than a corpus. >>>> >>>> Dima >>>> >>>> >>>> >>>> >>>> On Sep 27, 2014, at 12:15, Miller, Timothy < >>>> [email protected] <javascript:;>> wrote: >>>> >>>>> I've just been using the opennlp command line cross validator on the >>>> small dataset i annotated (along with some eyeballing). It would be cool if >>>> there was a standard clinical resource available for this task, but I >>>> hadn't considered it much because the data I annotated pulls from multiple >>>> datasets and the process of arranging with different institutions to make >>>> something like that available would probably be a nightmare. >>>>> Tim >>>>> >>>>> Sent from my iPad. Sorry about the typos. >>>>> >>>>>> On Sep 27, 2014, at 12:16 PM, "Dligach, Dmitriy" < >>>> [email protected] <javascript:;>> wrote: >>>>>> Tim, thanks for working on this! >>>>>> >>>>>> Question: do we have some formal way of evaluating the sentence >>>> detector? Maybe we should come up with some dev set that would include >>>> examples from mimic... >>>>>> Dima >>>>>> >>>>>> >>>>>> >>>>>> >>>>>>> On Sep 27, 2014, at 8:57, Miller, Timothy < >>>> [email protected] <javascript:;>> wrote: >>>>>>> I have been working on the sentence detector newline issue, training a >>>> model to probabilistically split sentences on newlines rather than forcing >>>> sentence breaks. I have checked in a model to the repo under >>>> ctakes-core-res. I also attached a patch to ctakes-core to the jira issue: >>>>>>> https://issues.apache.org/jira/browse/CTAKES-41 >>>>>>> >>>>>>> for people to test. The status of my testing is that it doesn't seem >>>> to break on notes where ctakes worked well before (those where newlines are >>>> always sentence breaks), and is a slight improvement on notes where >>>> newlines may or may not be sentence breaks. Once the change is checked in >>>> we can continue improving the model by adding more data and features, but >>>> the first hurdle I'd like to get past is making sure it runs well enough on >>>> the type of data that the old model worked well on. Let me know if you have >>>> any questions. >>>>>>> Thanks >>>>>>> Tim >
