Giuseppe Totaro created TIKA-1645:
-------------------------------------

             Summary: Extraction of biomedical information using CTAKESParser
                 Key: TIKA-1645
                 URL: https://issues.apache.org/jira/browse/TIKA-1645
             Project: Tika
          Issue Type: Improvement
          Components: parser
            Reporter: Giuseppe Totaro


As mentioned in [TIKA-1642|https://issues.apache.org/jira/browse/TIKA-1642], 
[CTAKESContentHandler|https://github.com/giuseppetotaro/CTAKESContentHadler] is 
a preliminary work in order to integrate [Apache 
cTAKES|http://ctakes.apache.org/] into Tika allowing users to extract 
biomedical information from clinical text.
Essentially, this work includes a wrapper for CAS serializers that aim at 
dumping out the identified annotations into XML-based formats.

You can find in attachment a new patch that includes the CTAKESParser, a new 
parser that decorates the AutoDetectParser and relies on a new version of 
CTAKESContentHandler, based on feedback from 
[TIKA-1642|https://issues.apache.org/jira/browse/TIKA-1642]. This parser 
generates the same output of AutoDetectParser and, in addition, the metadata 
containing the identified clinical annotations detected by cTAKES.

To perform a cTAKES AnalysisEngine by using Tika CTAKESParser, you need first 
to install the last stable release of cTAKES (3.2.2), following the 
instructions on [User Install 
Guide|https://cwiki.apache.org/confluence/display/CTAKES/cTAKES+3.2+User+Install+Guide].
 Then, you can launch Tika as follows:
{noformat}
CTAKES_HOME=/usr/local/apache-ctakes-3.2.2
java -cp 
tika-app-1.10-SNAPSHOT.jar:${CTAKES_HOME}/desc:${CTAKES_HOME}/resources:${CTAKES_HOME}/lib/*:/path/to/CTAKESConfig
 org.apache.tika.cli.TikaCLI --config=/path/to/tika-config.xml /path/to/input
{noformat}
In the example above, {{/path/to/CTAKESConfig}} is the parent directory of file 
{{org/apache/tika/parser/ctakes/CTAKESConfig.properties}} that contains the 
configuration properties to build the cTAKES AnalysisEngine; 
{{tika-config.xml}} is a custom configuration file for Tika that contains the 
mimetypes whose CTAKESParser will perform parsing.
You can find in attachment an example of both {{CTAKESConfig.properties}} and 
{{tika-config.xml}} to parse ISA-Tab files using cTAKES.

You need [UMLS credentials|https://uts.nlm.nih.gov/home.html] in order to use 
the UMLS-based components of cTAKES.

I would really appreciate your feedback.
Thanks [~selina], [~chrismattmann] and [~lewismc] for supporting me on this 
work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to