[ 
https://issues.apache.org/jira/browse/TIKA-1739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14904156#comment-14904156
 ] 

Nick Burch commented on TIKA-1739:
----------------------------------

My view is that {{AutoDetectParser}} is a special kind of parser decorator too. 
It doesn't do parsing, it decorates a set of other parsers by first doing 
detection, then handing the type to those to be processed. Because of this, 
{{AutoDetectParser}} has a few other restrictions, such as that it can't be set 
in the Tika Config file, you have to explicitly ask for it, and it must be the 
outer-most decorator

As I understand it, what the cTAKES decorator does is enhance the output of 
other parsers with medical related information, either all parsers, or just 
some types+parsers. As such, I believe it needs to go outside of 
{{DefaultParser}} or a collection of explicit parsers wrapped as a 
{{CompositeParser}}. It waits for those real parser(s) to run, then 
enhances/decorates their output. It needs to be inside the AutoDetectParser 
"decoration", as it needs to wait for the type to be found before it can work 
out if it applies or not (for many cases)

The key thing to remember - {{AutoDetectParser}} is not a parser! It's a 
decorator on a set of other parsers, which finds the type first. You might give 
your file to {{AutoDetectParser}}, but that isn't actually what does the work.

> cTAKESParser doesn't work in 1.11
> ---------------------------------
>
>                 Key: TIKA-1739
>                 URL: https://issues.apache.org/jira/browse/TIKA-1739
>             Project: Tika
>          Issue Type: Bug
>          Components: parser, server
>            Reporter: Chris A. Mattmann
>            Assignee: Chris A. Mattmann
>             Fix For: 1.11
>
>         Attachments: TIKA-1739.patch
>
>
> Tika cTAKESParser integration doesn't work in 1.11. The parser is called, but 
> blank metadata comes back:
> {noformat}
> curl -T test.txt -H "Content-Type: text/plain" 
> http://localhost:9999/rmeta/text
> [{"Content-Type":"text/plain","X-Parsed-By":["org.apache.tika.parser.CompositeParser","org.apache.tika.parser.ctakes.CTAKESParser","org.apache.tika.parser.EmptyParser"],"X-TIKA:parse_time_millis":"20371","ctakes:schema":"coveredText:start:end:ontologyConceptArr"}
> {noformat}
> [~gagravarr] I wonder if something that happened in TIKA-1653 broke it?
> http://svn.apache.org/viewvc?view=revision&revision=1684199
> [~gostep] can you help me look here?
> I'm working on 
> https://github.com/chrismattmann/shangridocs/tree/convert-wicket which is 
> where I first saw this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to