Hi Ted/Jay, Thanks for suggesting and taking this up…. What information will be needed to accomplish what you were thinking? Just thinking aloud here:
1) Test data. I think John Green crafted about 20-30 notes in the data folder. We can use this as a starting point. 2) Code to run though the various components and pipelines? 3) Environments to run thru different O/S/hardware, etc.? 4) Create a Gold Standard format (Knowtator and/or Anafora). cTAKES already has existing readers for those. [For ML based examples?] I think there is an ctakes-regression project that we can probably just overwrite for new regression testing code. From: Ted Strall [mailto:tstr...@yahoo.com] Sent: Thursday, July 30, 2015 9:21 AM To: Chen, Pei; dev@ctakes.apache.org Subject: Re: Combining Knowledge- and Data-driven Methods for De-identification of Clinical Narratives How / when can we go about getting started on this? ________________________________ From: "Chen, Pei" <pei.c...@childrens.harvard.edu<mailto:pei.c...@childrens.harvard.edu>> To: "dev@ctakes.apache.org<mailto:dev@ctakes.apache.org>" <dev@ctakes.apache.org<mailto:dev@ctakes.apache.org>>; Ted Strall <tstr...@yahoo.com<mailto:tstr...@yahoo.com>> Sent: Friday, July 24, 2015 12:52 PM Subject: RE: Combining Knowledge- and Data-driven Methods for De-identification of Clinical Narratives Ted- Welcome to the community! I think this would be a great enhancement. Jay- I think the BigTop folks did a lot with the smoke and integration tests... Do you how they did it? Something we can reuse? --Pei -----Original Message----- From: Ted Strall [mailto:tstr...@yahoo.com.INVALID<mailto:tstr...@yahoo.com.INVALID>] Sent: Friday, July 24, 2015 12:31 PM To: dev@ctakes.apache.org<mailto:dev@ctakes.apache.org> Subject: Re: Combining Knowledge- and Data-driven Methods for De-identification of Clinical Narratives I would be interested in helping to develop / maintain a regression testing framework for that. I'm new to ctakes (and just recently started stalking the dev mailing list) but I've been a software engineer for 20 years and have done a lot of framework automation stuff that will probably be required. As I write this, I am working on an automated integration test that will run on Jenkins that fires up and load an h2 database, a solr instance, an in-house indexing pipeline and an in-house search service, indexes 10k documents and executes and evaluates some canned queries before shutting itself down. I'm also working on a MS in Predictive Analytics and I am interested in applying machine learning and NLP to medical informatics, so I would welcome the chance to get dirty with that side of stuff, also. From: Jay Vyas <jayunit100.apa...@gmail.com<mailto:jayunit100.apa...@gmail.com>> To: "dev@ctakes.apache.org<mailto:dev@ctakes.apache.org>" <dev@ctakes.apache.org<mailto:dev@ctakes.apache.org>> Sent: Friday, July 24, 2015 10:44 AM Subject: Re: Combining Knowledge- and Data-driven Methods for De-identification of Clinical Narratives Yes this is very interesting work. - If we have access to a large corpus of de identified records we can recession test the ctakes platform. - I can help collaborate on a regression testing framework if someone else wants to help Maintain it. > On Jul 24, 2015, at 11:12 AM, Pei Chen > <chen...@apache.org<mailto:chen...@apache.org>> wrote: > > Hi, > Re: > https://urldefense.proofpoint.com/v2/url?u=http-3A__www.sciencedirect. > com_science_article_pii_S1532046415001392&d=BQIFaQ&c=qS4goWBT7poplM69z > y_3xhKwEW14JZMSdioCoppxeFU&r=huK2MFkj300qccT8OSuuoYhy_xEYujfPwiAxhPVz5 > WY&m=IdFJ0ChLqz9-dg435_5Rea2_0EUPNDw0uCUKnNp_N7k&s=DOgavsLa7IIU0rgq8lx > DXTb33J8-4zgCWuKzL83CZyw&e= This is very interesting work and I think > it would be very valuable for the general community. Is this > something that you may be in interested in contributing/sharing the > code with the Apache cTAKES community? > Thanks, > Pei