Great question Nick. If you have a better idea on how to make it so that any file can come into the cTAKES parser, get its text and metadata parsed out, and then feed that into cTAKES I’m all ears. We just thought that decorating AutoDetect would serve that purpose for us. Since cTAKES just puts metadata in the met object (as of now) and doesn’t do XHTML content (future improvement), I supposed we could instantiate an AutoDetectParser instead of decorating it which may help. Dunno, anyways looking forward to what your thoughts are :-)
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: [email protected] WWW: http://sunset.usc.edu/~mattmann/ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ -----Original Message----- From: Nick Burch <[email protected]> Reply-To: "[email protected]" <[email protected]> Date: Sunday, June 7, 2015 at 5:01 AM To: "[email protected]" <[email protected]> Subject: Re: svn commit: r1683969 - /tika/trunk/tika-parsers/src/main/resources/META-INF/services/org.apache.t ika.parser.Parser >On Sun, 7 Jun 2015, Mattmann, Chris A (3980) wrote: >> Also the lovely thing here too is that since cTAKESParser is a >>decorator >> for AutoDetectParser there is magical infinite recursion if it’s >>enabled >> via SPI. > >Should it really be a wrapper for AutoDetectParser though? I haven't read >through the wiki page or the code yet (need to do that after lunch...), >but my general guess would've been that a wrapping parser should sit >between AutoDetectParser and DefaultParser? (AutoDetectParser normally >calls to DefaultParser via the Tika config). > >If it worked that way, we could slip it in between the two in the tika >config file. > >Though if someone could quickly point out why it needs to wrap outside >AutoDetectParser rather than inside, that'd save time! > >Nick
