[
https://issues.apache.org/jira/browse/TIKA-1739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14903389#comment-14903389
]
Nick Burch commented on TIKA-1739:
----------------------------------
We explicitly don't let you set an {{AutoDetectParser}} in the config, it's
something you have to choose to use, giving it the parser(s) you want used
post-detection
In the non-cTAKES case, you get a Composite Parser that'll handle your formats
(directly/explicitly/via Tika Config xml/via default Tika Config), then give
that (perhaps implicitly) to {{AutoDetectParser}}. {{AutoDetectParser}}
identifies the type of the document, then picks the right parser based on the
type
In the cTAKES case, you get your chosen Composite Parser again, and give that
to cTAKES (possibly via Tika Config xml, eg in the case above). You now create
an {{AutoDetectParser}} as before, and give it cTAKES. {{AutoDetectParser}}
identifies the type, then gives the document *with the type* to cTAKES, as
cTAKES claims all the mime types. cTAKES then uses its child Composite Parser
to have the real parsing done, based on the type that {{AutoDetectParser}}
supplied to it. When that's done, cTAKES then decorates the output.
Or, if you know the type yourself, you give that to cTAKES, which gives it to
the child Composite Parser for parsing, then decorates the result, with no
{{AutoDetectParser}} needed
> cTAKESParser doesn't work in 1.11
> ---------------------------------
>
> Key: TIKA-1739
> URL: https://issues.apache.org/jira/browse/TIKA-1739
> Project: Tika
> Issue Type: Bug
> Components: parser, server
> Reporter: Chris A. Mattmann
> Assignee: Chris A. Mattmann
> Fix For: 1.11
>
> Attachments: TIKA-1739.patch
>
>
> Tika cTAKESParser integration doesn't work in 1.11. The parser is called, but
> blank metadata comes back:
> {noformat}
> curl -T test.txt -H "Content-Type: text/plain"
> http://localhost:9999/rmeta/text
> [{"Content-Type":"text/plain","X-Parsed-By":["org.apache.tika.parser.CompositeParser","org.apache.tika.parser.ctakes.CTAKESParser","org.apache.tika.parser.EmptyParser"],"X-TIKA:parse_time_millis":"20371","ctakes:schema":"coveredText:start:end:ontologyConceptArr"}
> {noformat}
> [~gagravarr] I wonder if something that happened in TIKA-1653 broke it?
> http://svn.apache.org/viewvc?view=revision&revision=1684199
> [~gostep] can you help me look here?
> I'm working on
> https://github.com/chrismattmann/shangridocs/tree/convert-wicket which is
> where I first saw this.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)