Hi, On Fri, Sep 10, 2010 at 10:31 PM, Nick Burch <[email protected]> wrote: > Quite a lot of OfficeParser does depend on poifs code though, as well as a > few bits that depend on some of the less common POI text extractors.
It looks like a number of our other new parsers also have direct dependencies to external libraries, so this problem is not just related to the OfficeParser class. The basic problem here is that the service loader used by the default TikaConfig constructor throws an exception when it can't load a class listed in a org.apache.tika.parser.Parser service file. The solution I implemented in TIKA-378 for the 0.7 release was to move the external parser library references to separate extractor classes so that the parser class could be instantiated without problems. Unfortunately this was a one-off solution that obviously hasn't survived further development in the svn trunk. The reason why I originally didn't want to simply catch and ignore the potential exceptions in the TikaConfig constructor was the lack of a good error reporting mechanism. The trick of insulating the external library dependencies to separate extractor classes nicely solved that problem by delaying the exceptions to the actual parse() method calls on specific document types, which obviously would then give the end user a much better idea of what's wrong. Perhaps the best solution would actually be to combine the above approaches, i.e. to strive to maintain the parser/extractor separation where possible and to use a catch block in the TikaConfig constructor to catch and ignore any problems that the insulation approach fails to address. BR, Jukka Zitting
