[ 
https://issues.apache.org/jira/browse/TIKA-1645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14574545#comment-14574545
 ] 

Chris A. Mattmann commented on TIKA-1645:
-----------------------------------------

OK I got this working by downloading cTAKES 3.2.2.

{noformat}
bash-3.2$ java -classpath 
tika-app/target/tika-app-1.10-SNAPSHOT.jar:${CTAKES_HOME}/desc:${CTAKES_HOME}/resources:${CTAKES_HOME}/lib/\*:./config
 org.apache.tika.cli.TikaCLI --config=tika-config.xml -m 
/Users/mattmann/Downloads/Vose-2013-American_Journal_of_Hematology.pdf 
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in 
[jar:file:/Users/mattmann/tmp/tika1.9/tika-app/target/tika-app-1.10-SNAPSHOT.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in 
[jar:file:/usr/local/apache-ctakes-3.2.2/lib/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
log4j: reset attribute= "false".
log4j: Threshold ="null".
log4j: Retreiving an instance of org.apache.log4j.Logger.
log4j: Setting [ProgressAppender] additivity to [false].
log4j: Level value for ProgressAppender is  [INFO].
log4j: ProgressAppender level set to INFO
log4j: Class name: [org.apache.log4j.ConsoleAppender]
log4j: Parsing layout of class: "org.apache.log4j.PatternLayout"
log4j: Setting property [conversionPattern] to [%m].
log4j: Adding appender named [noEolAppender] to category [ProgressAppender].
log4j: Retreiving an instance of org.apache.log4j.Logger.
log4j: Setting [ProgressDone] additivity to [false].
log4j: Level value for ProgressDone is  [INFO].
log4j: ProgressDone level set to INFO
log4j: Class name: [org.apache.log4j.ConsoleAppender]
log4j: Parsing layout of class: "org.apache.log4j.PatternLayout"
log4j: Setting property [conversionPattern] to [%m%n].
log4j: Adding appender named [eolAppender] to category [ProgressDone].
log4j: Level value for root is  [INFO].
log4j: root level set to INFO
log4j: Class name: [org.apache.log4j.ConsoleAppender]
log4j: Parsing layout of class: "org.apache.log4j.PatternLayout"
log4j: Setting property [conversionPattern] to [%d{dd MMM yyyy HH:mm:ss} %5p 
%c{1} - %m%n].
log4j: Adding appender named [consoleAppender] to category [root].
04 Jun 2015 23:12:40  INFO ClearNLPDependencyParserAE - using Morphy analysis? 
true
Loading configuration.
Loading feature templates.
Loading lexica.
Loading model:
........................................................................................
04 Jun 2015 23:12:50  INFO TokenizerAnnotatorPTB - Initializing 
org.apache.ctakes.core.ae.TokenizerAnnotatorPTB
04 Jun 2015 23:12:50  INFO ContextDependentTokenizerAnnotator - Finite state 
machines loaded.
04 Jun 2015 23:12:50  INFO ConstituencyParser - Initializing parser...
04 Jun 2015 23:12:52  INFO ContextAnnotator - SCOPE ORDER: [1, 3]
04 Jun 2015 23:12:52  INFO StatusContextAnalyzer - initBoundaryData() called 
for ContextInitializer
04 Jun 2015 23:12:52  INFO ContextAnnotator - SCOPE ORDER: [1, 3]
04 Jun 2015 23:12:52  INFO NegationContextAnalyzer - initBoundaryData() called 
for ContextInitializer
04 Jun 2015 23:12:53  INFO SentenceDetector - Sentence detector model file: 
org/apache/ctakes/core/sentdetect/sd-med-model.zip
04 Jun 2015 23:12:55  INFO POSTagger - POS tagger model file: 
org/apache/ctakes/postagger/models/mayo-pos.zip
Loading configuration.
Loading feature templates.
Loading model:
.
Loading configuration.
Loading feature templates.
Loading lexica.
Loading model:
...
Loading model:
.
Loading model:
.
Loading model:
.
Loading model:
.
Loading model:
.
Loading model:
.
Loading model:
.
Loading model:
.
Loading model:
.
Loading model:
.
Loading model:
.
Loading model:
.......
Loading model:
.
Loading model:
.
Loading model:
.
Loading model:
.
Loading model:
..........
Loading model:
.
Loading model:
...
Loading model:
...
Loading model:
.
Loading model:
.
Loading model:
....
Loading model:
.
Loading model:
.
Loading model:
...
Loading model:
.
Loading model:
.....
Loading model:
.
Loading model:
.
Loading model:
...
Loading model:
.
Loading model:
.
Loading model:
.
Loading model:
...
Loading model:
........
Loading model:
...
Loading model:
.
Loading model:
.
Loading model:
.
Loading model:
....
Loading model:
.
Loading model:
........
Loading model:
.
Loading model:
.
Loading model:
.
Loading model:
.
Loading model:
.
Loading model:
.
Loading model:
.
Loading model:
.
Loading model:
.
Loading model:
.
Loading model:
.
Loading model:
.
Loading model:
....
Loading model:
...
Loading model:
.
Loading model:
.
Loading model:
.
Loading model:
.
Loading model:
.
Loading model:
.
Loading model:
.
Loading model:
.....
Loading model:
...
Loading model:
.
Loading model:
.
Loading model:
.
Loading model:
.
Loading model:
.
Loading model:
.
Loading model:
.
Loading model:
.
Loading model:
.
Loading model:
..............
Loading model:
.
Loading model:
...
Loading model:
.
Loading model:
.
Loading model:
....
Loading model:
....
Loading model:
.
Loading model:
....
Loading model:
.
Loading model:
...
Loading model:
.
Loading model:
....
Loading model:
.......
Loading model:
.
Loading model:
...
Loading model:
.
Loading model:
.
Loading model:
.
Loading model:
....
Loading model:
.
Loading model:
.
Loading model:
.
Loading model:
....
Loading model:
.
Loading model:
...
Loading model:
.
Loading model:
.
Loading model:
.
Loading model:
.
Loading model:
...
Loading model:
.
Loading model:
.
Loading model:
.
Loading model:
.
Loading model:
.
Loading model:
...
Loading model:
.
Loading model:
...
Loading model:
.
Loading model:
.
Loading model:
.
Loading model:
.
Loading model:
...
Loading model:
.
Loading model:
.....
Loading model:
......
Loading model:
.
Loading model:
.
Loading model:
.
Loading model:
.
Loading model:
....
Loading model:
.
Loading model:
.
Loading model:
....
Loading model:
.
Loading model:
.
Loading model:
...
Loading model:
.
Loading model:
.
Loading model:
...
Loading model:
.
Loading model:
....
Loading model:
...
Loading model:
.
Loading model:
....
Loading model:
.
Loading model:
.
Loading model:
.
Loading model:
.
Loading model:
.
Loading model:
.
Loading model:
...
Loading model:
.
Loading model:
.
Loading model:
.
Loading model:
.
Loading model:
...
Loading model:
.
Loading model:
.......
Loading model:
.
Loading model:
.
Loading model:
....
Loading model:
.
Loading model:
.
Loading model:
.......
Loading model:
.
Loading model:
.
Loading model:
.
Loading model:
...
Loading model:
.
Loading configuration.
Loading feature templates.
Loading lexica.
Loading model:
................................
Loading model:
.............................
04 Jun 2015 23:12:57  INFO Chunker - Chunker model file: 
org/apache/ctakes/chunker/models/chunker-model.zip
04 Jun 2015 23:12:59  INFO JdbcConnectionResourceImpl - Connection established 
to: jdbc:hsqldb:res:org/apache/ctakes/dictionary/lookup/umls2011ab/umls
04 Jun 2015 23:12:59  INFO JdbcConnectionResourceImpl - Connection established 
to: jdbc:hsqldb:res:org/apache/ctakes/dictionary/lookup/rxnorm-hsqldb/umls
04 Jun 2015 23:12:59  INFO JdbcConnectionResourceImpl - Connection established 
to: jdbc:hsqldb:res:org/apache/ctakes/dictionary/lookup/orange_book_hsqldb/umls
04 Jun 2015 23:12:59  INFO UmlsDictionaryLookupAnnotator - Parsing descriptor: 
/usr/local/apache-ctakes-3.2.2/resources/org/apache/ctakes/dictionary/lookup/LookupDesc_Db.xml
04 Jun 2015 23:12:59  INFO FirstTokenPermLookupInitializerImpl - Exclusion 
tagset loaded: [dt, to, rp, ls, pos, md, vbd, vbg, vb, ex, vbp, vbn, pdt, vbz, 
wp, wrb, in, wps, pp$, prp$, wdt, prp, pp, cc, cd]
04 Jun 2015 23:12:59  INFO FirstTokenPermLookupInitializerImpl - Exclusion 
tagset loaded: [to, dt, rp, ex, vbp, ls, vbn, pdt, wp, vbz, wrb, in, pos, wps, 
md, wdt, pp$, vbd, vb, vbg, pp, cc, cd]
04 Jun 2015 23:12:59  INFO UmlsDictionaryLookupAnnotator - Using 
ctakes.umlsaddr: https://uts-ws.nlm.nih.gov/restful/isValidUMLSUser: 
chrismattmann
04 Jun 2015 23:13:00  INFO LvgCmdApiResourceImpl - Loading NLM Norm and Lvg 
with config file = 
/usr/local/apache-ctakes-3.2.2/resources/org/apache/ctakes/lvg/data/config/lvg.properties
04 Jun 2015 23:13:00  INFO LvgCmdApiResourceImpl -   config file absolute path 
= 
/usr/local/apache-ctakes-3.2.2/resources/org/apache/ctakes/lvg/data/config/lvg.properties
04 Jun 2015 23:13:00  INFO LvgCmdApiResourceImpl - cwd = 
/Users/mattmann/tmp/tika1.9
04 Jun 2015 23:13:00  INFO LvgCmdApiResourceImpl - cd 
/usr/local/apache-ctakes-3.2.2/resources/org/apache/ctakes/lvg/
04 Jun 2015 23:13:01  INFO LvgCmdApiResourceImpl - cd 
/Users/mattmann/tmp/tika1.9
04 Jun 2015 23:13:01  INFO SentenceDetector - Starting processing.
04 Jun 2015 23:13:01  INFO TokenizerAnnotatorPTB - process(JCas) in 
org.apache.ctakes.core.ae.TokenizerAnnotatorPTB
04 Jun 2015 23:13:01  INFO LvgAnnotator - process(JCas)
04 Jun 2015 23:13:05  INFO ContextDependentTokenizerAnnotator - process(JCas)
04 Jun 2015 23:13:07  INFO POSTagger - process(JCas)
04 Jun 2015 23:13:08  INFO Chunker -  process(JCas)
04 Jun 2015 23:13:09  INFO ChunkAdjuster -  process(JCas)
04 Jun 2015 23:13:09  INFO ChunkAdjuster -  process(JCas)
04 Jun 2015 23:13:09  INFO CopyAnnotator - process(JCas)
04 Jun 2015 23:13:09  INFO OverlapAnnotator - process(JCas)
04 Jun 2015 23:13:09  INFO UmlsDictionaryLookupAnnotator - process(JCas)
04 Jun 2015 23:13:54  INFO MaxentParserWrapper - Started processing: null
Couldn't find parse for: � Claim your Certificate.This activity will be 
available for CME credit for twelvemonths following its launch date.
Couldn't find parse for: Asymptomatic elderly or low-MIPI patients can be 
observed without any therapy.When the patients become symptomatic, first line 
therapychoices include R-CHOP (6 rituximab maintenance), R-Bendamustine, or a 
clinical trial.Initial management of a young symptomatic patientSeveral studies 
have suggested that aggressive thera-pies in younger patients with MCL may 
improve the out-comes.
04 Jun 2015 23:14:30  INFO MaxentParserWrapper - Done parsing: null
 Content-Length: 457115
Content-Type: application/pdf
Creation-Date: 2013-11-20T13:24:11Z
Last-Modified: 2013-11-22T14:13:25Z
Last-Save-Date: 2013-11-22T14:13:25Z
WPS-ARTICLEDOI: 10.1002/ajh.23615
WPS-JOURNALDOI: 10.1002/(ISSN)1096-8652
WPS-PROCLEVEL: 2
X-Parsed-By: org.apache.tika.parser.CompositeParser
X-Parsed-By: org.apache.tika.parser.ctakes.CTAKESParser
X-Parsed-By: org.apache.tika.parser.DefaultParser
X-Parsed-By: org.apache.tika.parser.pdf.PDFParser
access_permission:assemble_document: true
access_permission:can_modify: true
access_permission:can_print: true
access_permission:can_print_degraded: true
access_permission:extract_content: true
access_permission:extract_for_accessibility: true
access_permission:fill_in_form: true
access_permission:modify_annotations: true
created: Wed Nov 20 05:24:11 PST 2013
ctakes:AnatomicalSiteMention: Cell:189:193:C0007634,C1269647
ctakes:AnatomicalSiteMention: Media:432:437:C0162867
ctakes:AnatomicalSiteMention: Media:593:598:C0162867
..
{noformat}

I also made it work for PDF by adding application/pdf to the tika-config.xml. 
Great work Giuseppe.With a unit test, this would be perfect to commit!


> Extraction of biomedical information using CTAKESParser
> -------------------------------------------------------
>
>                 Key: TIKA-1645
>                 URL: https://issues.apache.org/jira/browse/TIKA-1645
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>            Reporter: Giuseppe Totaro
>            Assignee: Giuseppe Totaro
>              Labels: patch
>             Fix For: 1.10
>
>         Attachments: CTAKESConfig.properties, TIKA-1645.patch, 
> TIKA-1645.v02.patch, tika-config.xml
>
>
> As mentioned in [TIKA-1642|https://issues.apache.org/jira/browse/TIKA-1642], 
> [CTAKESContentHandler|https://github.com/giuseppetotaro/CTAKESContentHadler] 
> is a preliminary work in order to integrate [Apache 
> cTAKES|http://ctakes.apache.org/] into Tika allowing users to extract 
> biomedical information from clinical text.
> Essentially, this work includes a wrapper for CAS serializers that aim at 
> dumping out the identified annotations into XML-based formats.
> You can find in attachment a new patch that includes the CTAKESParser, a new 
> parser that decorates the AutoDetectParser and relies on a new version of 
> CTAKESContentHandler, based on feedback from 
> [TIKA-1642|https://issues.apache.org/jira/browse/TIKA-1642]. This parser 
> generates the same output of AutoDetectParser and, in addition, the metadata 
> containing the identified clinical annotations detected by cTAKES.
> To perform a cTAKES AnalysisEngine by using Tika CTAKESParser, you need first 
> to install the last stable release of cTAKES (3.2.2), following the 
> instructions on [User Install 
> Guide|https://cwiki.apache.org/confluence/display/CTAKES/cTAKES+3.2+User+Install+Guide].
>  Then, you can launch Tika as follows:
> {noformat}
> CTAKES_HOME=/usr/local/apache-ctakes-3.2.2
> java -cp 
> tika-app-1.10-SNAPSHOT.jar:${CTAKES_HOME}/desc:${CTAKES_HOME}/resources:${CTAKES_HOME}/lib/*:/path/to/CTAKESConfig
>  org.apache.tika.cli.TikaCLI --config=/path/to/tika-config.xml /path/to/input
> {noformat}
> In the example above, {{/path/to/CTAKESConfig}} is the parent directory of 
> file {{org/apache/tika/parser/ctakes/CTAKESConfig.properties}} that contains 
> the configuration properties to build the cTAKES AnalysisEngine; 
> {{tika-config.xml}} is a custom configuration file for Tika that contains the 
> mimetypes whose CTAKESParser will perform parsing.
> You can find in attachment an example of both {{CTAKESConfig.properties}} and 
> {{tika-config.xml}} to parse ISA-Tab files using cTAKES.
> You need [UMLS credentials|https://uts.nlm.nih.gov/home.html] in order to use 
> the UMLS-based components of cTAKES.
> I would really appreciate your feedback.
> Thanks [~selina], [~chrismattmann] and [~lewismc] for supporting me on this 
> work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to