Alan Simmons created TIKA-2188:
----------------------------------

             Summary: Illegal SAXException when using cTAKESParser
                 Key: TIKA-2188
                 URL: https://issues.apache.org/jira/browse/TIKA-2188
             Project: Tika
          Issue Type: Bug
          Components: cli, parser
    Affects Versions: 1.13
         Environment: Ubuntu 14.04.5 LTS
            Reporter: Alan Simmons


Contents:
1. Description of problem
2. My tika-config.xml file
3. My CTAKESConfig.properties file
4. Error stack of problem

DESCRIPTION OF PROBLEM:
I am trying to configure Tika to use cTAKES as a parser, per instructions in 
https://wiki.apache.org/tika/cTAKESParser.

I am working on a Mac running Sierra (OSX 10.12.1). 

I was able to configure Tika 1.13 to run with cTAKES 3.2.2 as a parser in my 
OSX environment. In particular, I was able to run both the standalone app and 
server against the sample file (Vose...pdf) mentioned in the Wiki.

I then tried to configure Tika 1.15 (the version from the github repo) in a 
Docker container. The OS for the Docker is Ubuntu 14.04.5.

I tried to run the Tika standalone app jar against the Vose PDF. It failed with 
the stack trace that I include at the bottom of this message.

I then tried to run the 1.13 Tika app in the Docker. Same problem. 
In the Docker,
1. I am able to run the Tika 1.15 app with the Default parser (e.g., without 
referring to the custom configuration XML for cTAKES.
2. I am able to run the Tika 1.15 app if the configuration file uses the 
Default parser before the org.apache.tika.parser.ctakes.CTAKESParser.
3. I am able to run cTAKES directly from the CLI against the Vose PDF, so I 
know that cTAKEs can parse the file.
4. I ran pdfbox-app-2.0.3.jar against the file with no errors.


---------------------
MY tika-config.xml file:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<properties>
  <parsers>
    <parser class="org.apache.tika.parser.ctakes.CTAKESParser">
      <mime>application/x-isatab</mime>
      <mime>application/pdf</mime>
      <mime>text/plain</mime>
    </parser>
  </parsers>
</properties>

-----
MY CTAKESConfig.properties file:

aeDescriptorPath=/ctakes-clinical-pipeline/desc/analysis_engine/AggregatePlaintextUMLSProcessor.xml
text=true
annotationProps=BEGIN,END,ONTOLOGY_CONCEPT_ARR
separatorChar=:
metadata=Problem,oncologic history,medical history,Study Title, Study 
Description
UMLSUser=<my UMLS user name>
UMLSPass=<my UMLS password>

---
ERROR STACK TRACE

NOTE: By comparing the info messages produced in different scenarios 
(successful Tika+cTAKES, cTAKES CPE, and unsuccessful Tika +cTAKEs), it looks 
like Tika is loading the cTAKES parser, but having some issue right after POS 
tagging.

java -Xms256m -Xmx1024m -classpath 
$HOME/src/ctakes-config:/tika/tika-app/target/tika-app-1.15-SNAPSHOT.jar:${CTAKES_HOME}/desc:${CTAKES_HOME}/resources:${CTAKES_HOME}/lib/\*
 org.apache.tika.cli.TikaCLI --config=$HOME/src/ctakes-config/tika-config.xml 
-m Vose.pdf
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in 
[jar:file:/tika/tika-app/target/tika-app-1.15-SNAPSHOT.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in 
[jar:file:/ctakes_files/apache-ctakes-3.2.2/lib/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
log4j: reset attribute= "false".
log4j: Threshold ="null".
log4j: Retreiving an instance of org.apache.log4j.Logger.
log4j: Setting [ProgressAppender] additivity to [false].
log4j: Level value for ProgressAppender is  [INFO].
log4j: ProgressAppender level set to INFO
log4j: Class name: [org.apache.log4j.ConsoleAppender]
log4j: Parsing layout of class: "org.apache.log4j.PatternLayout"
log4j: Setting property [conversionPattern] to [%m].
log4j: Adding appender named [noEolAppender] to category [ProgressAppender].
log4j: Retreiving an instance of org.apache.log4j.Logger.
log4j: Setting [ProgressDone] additivity to [false].
log4j: Level value for ProgressDone is  [INFO].
log4j: ProgressDone level set to INFO
log4j: Class name: [org.apache.log4j.ConsoleAppender]
log4j: Parsing layout of class: "org.apache.log4j.PatternLayout"
log4j: Setting property [conversionPattern] to [%m%n].
log4j: Adding appender named [eolAppender] to category [ProgressDone].
log4j: Level value for root is  [INFO].
log4j: root level set to INFO
log4j: Class name: [org.apache.log4j.ConsoleAppender]
log4j: Parsing layout of class: "org.apache.log4j.PatternLayout"
log4j: Setting property [conversionPattern] to [%d{dd MMM yyyy HH:mm:ss} %5p 
%c{1} - %m%n].
log4j: Adding appender named [consoleAppender] to category [root].
01 Dec 2016 20:38:45  INFO LvgCmdApiResourceImpl - Loading NLM Norm and Lvg 
with config file = 
/ctakes_files/apache-ctakes-3.2.2/resources/org/apache/ctakes/lvg/data/config/lvg.properties
01 Dec 2016 20:38:45  INFO LvgCmdApiResourceImpl -   config file absolute path 
= 
/ctakes_files/apache-ctakes-3.2.2/resources/org/apache/ctakes/lvg/data/config/lvg.properties
01 Dec 2016 20:38:45  INFO LvgCmdApiResourceImpl - cwd = /
01 Dec 2016 20:38:45  INFO LvgCmdApiResourceImpl - cd 
/ctakes_files/apache-ctakes-3.2.2/resources/org/apache/ctakes/lvg/
01 Dec 2016 20:38:45  INFO LvgCmdApiResourceImpl - cd /
01 Dec 2016 20:38:45  INFO ClearNLPDependencyParserAE - using Morphy analysis? 
true
Loading configuration.
Loading feature templates.
Loading lexica.
Loading model:
........................................................................................
01 Dec 2016 20:39:01  INFO Chunker - Chunker model file: 
org/apache/ctakes/chunker/models/chunker-model.zip
01 Dec 2016 20:39:02  INFO ContextDependentTokenizerAnnotator - Finite state 
machines loaded.
01 Dec 2016 20:39:02  INFO ConstituencyParser - Initializing parser...
01 Dec 2016 20:39:07  INFO ContextAnnotator - SCOPE ORDER: [1, 3]
01 Dec 2016 20:39:07  INFO NegationContextAnalyzer - initBoundaryData() called 
for ContextInitializer
01 Dec 2016 20:39:08  INFO POSTagger - POS tagger model file: 
org/apache/ctakes/postagger/models/mayo-pos.zip
Exception in thread "main" org.apache.tika.exception.TikaException: TIKA-237: 
Illegal SAXException from org.apache.tika.parser.ParserDecorator$1@5fe1ce85
        at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:290)
        at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
        at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
        at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:191)
        at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:480)
        at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:145)
Caused by: org.xml.sax.SAXException
        at 
org.apache.tika.parser.ctakes.CTAKESContentHandler.endDocument(CTAKESContentHandler.java:162)
        at 
org.apache.tika.sax.ContentHandlerDecorator.endDocument(ContentHandlerDecorator.java:115)
        at 
org.apache.tika.sax.ContentHandlerDecorator.endDocument(ContentHandlerDecorator.java:115)
        at 
org.apache.tika.sax.SafeContentHandler.endDocument(SafeContentHandler.java:281)
        at 
org.apache.tika.sax.XHTMLContentHandler.endDocument(XHTMLContentHandler.java:230)
        at org.apache.tika.parser.EmptyParser.parse(EmptyParser.java:55)
        at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
        at 
org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:188)
        at 
org.apache.tika.parser.ctakes.CTAKESParser.parse(CTAKESParser.java:85)
        at 
org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:188)
        at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
        ... 5 more



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to