[jira] [Commented] (TIKA-1645) Extraction of biomedical information using CTAKESParser

Giuseppe Totaro (JIRA) Wed, 03 Jun 2015 23:30:05 -0700

    [ 
https://issues.apache.org/jira/browse/TIKA-1645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14572252#comment-14572252
 ]


Giuseppe Totaro commented on TIKA-1645:
---------------------------------------

Hi [~chrismattmann], thanks for your feedback. I really appreciate it.
You can find in attachment a new patch. Basically, the patch includes the java 
class CTAKESPaser that decorates the AutoDetectParser and leverages on cTAKES 
java APIs in order to extract biomedical information from text and, optionally, 
metadata. Then, all the 
[IndetifiedAnnotation|http://ctakes.apache.org/apidocs/trunk/org/apache/ctakes/typesystem/type/textsem/IdentifiedAnnotation.html]’s
 extracted by cTAKES are included into file metadata using, by default, the 
prefix {{ctakes:}}.

To build Tika with this patch via maven, I had to modify 
{{tika-parsers/pom.xml}} and {{tika-bundle/pom.xml}}, otherwise several “cannot 
find symbol” errors would
be generated at compile time. More in detail, I added the {{ctakes-core}} 
dependency (scope “provided") to {{tika-parsers/pom.xml}} and I excluded both 
ctakes and uima dependencies in {{tika-bundle/pom.xml}} using the following 
directives into {{<ImportPackage>}}:
{noformat}
!org.apache.ctakes.*
!org.apache.uima.*
{noformat}

By the way, I am going to implement another version of CTAKESParser as an 
external parser.
Thanks again,
Giuseppe

> Extraction of biomedical information using CTAKESParser
> -------------------------------------------------------
>
>                 Key: TIKA-1645
>                 URL: https://issues.apache.org/jira/browse/TIKA-1645
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>            Reporter: Giuseppe Totaro
>            Assignee: Giuseppe Totaro
>              Labels: patch
>             Fix For: 1.10
>
>         Attachments: CTAKESConfig.properties, TIKA-1645.patch, 
> TIKA-1645.v02.patch, tika-config.xml
>
>
> As mentioned in [TIKA-1642|https://issues.apache.org/jira/browse/TIKA-1642], 
> [CTAKESContentHandler|https://github.com/giuseppetotaro/CTAKESContentHadler] 
> is a preliminary work in order to integrate [Apache 
> cTAKES|http://ctakes.apache.org/] into Tika allowing users to extract 
> biomedical information from clinical text.
> Essentially, this work includes a wrapper for CAS serializers that aim at 
> dumping out the identified annotations into XML-based formats.
> You can find in attachment a new patch that includes the CTAKESParser, a new 
> parser that decorates the AutoDetectParser and relies on a new version of 
> CTAKESContentHandler, based on feedback from 
> [TIKA-1642|https://issues.apache.org/jira/browse/TIKA-1642]. This parser 
> generates the same output of AutoDetectParser and, in addition, the metadata 
> containing the identified clinical annotations detected by cTAKES.
> To perform a cTAKES AnalysisEngine by using Tika CTAKESParser, you need first 
> to install the last stable release of cTAKES (3.2.2), following the 
> instructions on [User Install 
> Guide|https://cwiki.apache.org/confluence/display/CTAKES/cTAKES+3.2+User+Install+Guide].
>  Then, you can launch Tika as follows:
> {noformat}
> CTAKES_HOME=/usr/local/apache-ctakes-3.2.2
> java -cp 
> tika-app-1.10-SNAPSHOT.jar:${CTAKES_HOME}/desc:${CTAKES_HOME}/resources:${CTAKES_HOME}/lib/*:/path/to/CTAKESConfig
>  org.apache.tika.cli.TikaCLI --config=/path/to/tika-config.xml /path/to/input
> {noformat}
> In the example above, {{/path/to/CTAKESConfig}} is the parent directory of 
> file {{org/apache/tika/parser/ctakes/CTAKESConfig.properties}} that contains 
> the configuration properties to build the cTAKES AnalysisEngine; 
> {{tika-config.xml}} is a custom configuration file for Tika that contains the 
> mimetypes whose CTAKESParser will perform parsing.
> You can find in attachment an example of both {{CTAKESConfig.properties}} and 
> {{tika-config.xml}} to parse ISA-Tab files using cTAKES.
> You need [UMLS credentials|https://uts.nlm.nih.gov/home.html] in order to use 
> the UMLS-based components of cTAKES.
> I would really appreciate your feedback.
> Thanks [~selina], [~chrismattmann] and [~lewismc] for supporting me on this 
> work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-1645) Extraction of biomedical information using CTAKESParser

Reply via email to