All, I have a proof-of-concept (not anywhere near ready for committing) modified/plagiarized version of the ForkParser available here:
https://github.com/tballison/tika-addons/tree/1.18/recursive-fork-parser/src/main/java/org/tallison/tika/parser/forkrecursive First, the ForkParser is simply genius. Every time I dig into it, I feel like I'm looking at advanced alien technology. I see three drawbacks to our current ForkParser: 1) It requires that the full tika parser be on the client's class path, it then sends that parser and inputstream to a separate process for the actual processing. I think we're lucky just to be able to build tika-app without too many jar conflicts. 2) Related, it requires that all of our dependencies be serializable. 3) I don't see an easy way to incorporate the RecursiveParserWrapper, partly because of my mistakes in implementing it! My current alternative moves most of Tika to the child process, so the client only needs tika-core and tika-serialization. The client specifies a directory where tika-app and optional dependencies live, and the child process builds a Parser from that. The current alternative uses the RecursiveParserWrapper as the (hard coded) default, but I think we could fairly easily make this configurable via tika-config.xml (ParserFactory) The current alternative uses a TextContentHandler, not xhtml...again, I _think_ we could make this configurable via tika-config (ContentHandlerFactory). My current proof of concept is strictly file based...should be easy enough to fix. Anyhow, any and all feedback is welcomed. Cheers, Tim
