[jira] [Commented] (TIKA-1645) Extraction of biomedical information using CTAKESParser

Chris A. Mattmann (JIRA) Sat, 06 Jun 2015 16:08:08 -0700

    [ 
https://issues.apache.org/jira/browse/TIKA-1645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14575996#comment-14575996
 ]


Chris A. Mattmann commented on TIKA-1645:
-----------------------------------------

I got this working with both tika-app and tika-server. See TIKA-1652 for a 
needed fix for this to work. I'm going to go ahead and commit this since I 
fully documented how to install and test and it's working good for me. It 
should work fine for 1.9 since it's not enabled by default and you have to do 
quite a bit to get it running. I'd love unit tests at some point, but not a 
blocker to getting this great piece of code part of 1.9. Thanks for the great 
work [~gostep] and [~selina]!

> Extraction of biomedical information using CTAKESParser
> -------------------------------------------------------
>
>                 Key: TIKA-1645
>                 URL: https://issues.apache.org/jira/browse/TIKA-1645
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>            Reporter: Giuseppe Totaro
>            Assignee: Chris A. Mattmann
>              Labels: patch
>             Fix For: 1.10
>
>         Attachments: CTAKESConfig.properties, TIKA-1645.patch, 
> TIKA-1645.v02.patch, tika-config.xml
>
>
> As mentioned in [TIKA-1642|https://issues.apache.org/jira/browse/TIKA-1642], 
> [CTAKESContentHandler|https://github.com/giuseppetotaro/CTAKESContentHadler] 
> is a preliminary work in order to integrate [Apache 
> cTAKES|http://ctakes.apache.org/] into Tika allowing users to extract 
> biomedical information from clinical text.
> Essentially, this work includes a wrapper for CAS serializers that aim at 
> dumping out the identified annotations into XML-based formats.
> You can find in attachment a new patch that includes the CTAKESParser, a new 
> parser that decorates the AutoDetectParser and relies on a new version of 
> CTAKESContentHandler, based on feedback from 
> [TIKA-1642|https://issues.apache.org/jira/browse/TIKA-1642]. This parser 
> generates the same output of AutoDetectParser and, in addition, the metadata 
> containing the identified clinical annotations detected by cTAKES.
> To perform a cTAKES AnalysisEngine by using Tika CTAKESParser, you need first 
> to install the last stable release of cTAKES (3.2.2), following the 
> instructions on [User Install 
> Guide|https://cwiki.apache.org/confluence/display/CTAKES/cTAKES+3.2+User+Install+Guide].
>  Then, you can launch Tika as follows:
> {noformat}
> CTAKES_HOME=/usr/local/apache-ctakes-3.2.2
> java -cp 
> tika-app-1.10-SNAPSHOT.jar:${CTAKES_HOME}/desc:${CTAKES_HOME}/resources:${CTAKES_HOME}/lib/*:/path/to/CTAKESConfig
>  org.apache.tika.cli.TikaCLI --config=/path/to/tika-config.xml /path/to/input
> {noformat}
> In the example above, {{/path/to/CTAKESConfig}} is the parent directory of 
> file {{org/apache/tika/parser/ctakes/CTAKESConfig.properties}} that contains 
> the configuration properties to build the cTAKES AnalysisEngine; 
> {{tika-config.xml}} is a custom configuration file for Tika that contains the 
> mimetypes whose CTAKESParser will perform parsing.
> You can find in attachment an example of both {{CTAKESConfig.properties}} and 
> {{tika-config.xml}} to parse ISA-Tab files using cTAKES.
> You need [UMLS credentials|https://uts.nlm.nih.gov/home.html] in order to use 
> the UMLS-based components of cTAKES.
> I would really appreciate your feedback.
> Thanks [~selina], [~chrismattmann] and [~lewismc] for supporting me on this 
> work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-1645) Extraction of biomedical information using CTAKESParser

Reply via email to