Hi Peter, I think the ctakes-examples is probably a good starting point at least in terms of maven modules, etc. I think it would be good if we use uimaFIT style as primary approach to wiring components together and generate desc's as secondary... I think the actual components that would be required is probably best left up to what is actually required for best performing c-deid. The output would be interesting, I'm not sure if we should treat this as an independent preprocessing component or part of a pipeline (in which case, we may need to propose a change to the type system or perhaps an alternative JCas view. You can probably open up that discussion to the dev group as you see fit.)
My 2 cents... On Fri, Nov 6, 2015 at 3:38 AM, Peter Klügl <peter.klu...@averbis.com> wrote: > Hi, > > Is there a cTAKES project that may serve as an example on how the cTAKES > community develops or how a project should look like? > I learned that different people set up UIMA project in a quite different > manner and I do not what to get inspired by "some sort of out-dated" > approach in the cTAKES repo. > > Are there restriction or preferences about the preprocessing components > that should be used and the kind of "output" of the project. > Components: On which components may the componetns rely: tokenizer, ... > parser, ... dict lookup? > "output": Should the project provide a pipeline or a single AE? > > More comments below. > > Am 03.11.2015 um 16:54 schrieb Azad Dehghan: >>> >>> >>> Who else plans to provide patches for it? Just to avoid duplicate work >>> and to coordnate the efforts ... >>> >> I would like to help with the translating JAPE to RUTA. > > You can already go ahead with the UIMA Ruta Workbench if you want, or > wait until I set up the project with ruta integration. > > If any questions arise, just ask :-) > >> >>> Is there a development dataset which was utilized for the initial >>> development, and if yes, is it possible to contribute it too? >>> >> The data set is unfortunately not publicly available; i2b2 >> <https://www.i2b2.org/NLP/DataSets/Main.php> typically releases the data >> sets 12 months after a given challenge; this is done on an individual basis >> and involve a Data Use Agreement. >> >> However, I will be able to conduct and coordinate the validation. >> > > Ok, I'll investigate if we have already access to the dataset here. > > >>> My first step would be: >>> - set up a maven project >>> - set up a development pipeline in a test (with cTAKES components >>> replacing the previous ANNIE preprocessing) >>> >>> >> >>> But one item that we need to review is the 3rd party libs jars that >>> were included to ensure compatibility. I’ll be sure to take a look at >>> that over the next few weeks. >>> >>> —Pei >>> >>> >> @Pei - once ANNIE components are replaced there is should not be a need to >> worry about the 3rd party libs. >> >> Also, just a thought: we may want to create an independent component for >> the Two Pass recognition (TwoPass.java) as this method have shown useful >> for general NER on longitudinal data and surely useful independent of the >> deid component. >> >> >> Cheers, >> Azad >> >