On Sun, 7 Jun 2015, Mattmann, Chris A (3980) wrote:
Great question Nick. If you have a better idea on how to make it so that
any file can come into the cTAKES parser, get its text and metadata
parsed out, and then feed that into cTAKES I’m all ears. We just thought
that decorating AutoDetect would serve that purpose for us. Since cTAKES
just puts metadata in the met object (as of now) and doesn’t do XHTML
content (future improvement), I supposed we could instantiate an
AutoDetectParser instead of decorating it which may help. Dunno, anyways
looking forward to what your thoughts are :-)
I've had a go at this, and fixed a few Tika bugs on the way... You can now
(as detailed in the javadoc) just do:
AutoDetectParser parser = new AutoDetectParser(new CTAKESParser());
And you'll get auto-detection with cTAKES applied to the result.
Alternately, if you want to turn on cTAKES support in config, for use eg
with the Tika CLI or Tika Server, you just need a config file like:
<properties>
<parsers>
<parser class="org.apache.tika.parser.ctakes.CTAKESParser">
<parser class="org.apache.tika.parser.DefaultParser"/>
</parser>
</parsers>
</properties>
(Example config file in SVN!)
Does this work for everyone?
Nick