Re: Configuring parsers and translators

Nick Burch Sat, 06 Jun 2015 12:30:41 -0700

Anyone have any thoughts on this?

On Fri, 8 May 2015, Nick Burch wrote:

Hi All
This came up in TIKA-1623, but I thought it might be better brought out tothe list for discussion
To configure parsers on a per-document basis, such as setting PDFspacing tolerances, or telling Tesseract what language it should beOCRing for, we have the *Config objects. You create one of these, usethe setters to configure it for your document, pop it onto the Parsecontext and it's used when processing your document
To configure parsers and translators on a per-JVM basis, to apply to alldocuments processed, it's a bit less consistent. At least some look fora properties file with a specific name, usually in the tika namespace,and grab their settings / keys / etc out of that. At least some expectto find a *Config with their program path on it, even though thatremains constant between documents. None of them support getting theirsettings from the Tika Config
As part of our evolution of parser preferences, we're moving towardspeople either being able to set their preferences in code, or being ableto supply a Tika Config xml which sets their parser preferences oroverrides certain bits of the default. The code option works for peoplewho want to declare certain specific things, the Tika Config one givesthe same functionality but allows a consistent and clean way to set itbetween Tika App, Tika Server and java code.
Another related example is the External Parser support. Because you canhave multiple External Parser instances in your setup, one per format /program, we look for all theorg/apache/tika/parser/external/tika-external-parsers.xml files on theclasspath, and create parser instances based on definitions in there
What do we think about setting executable paths and keys/logins forparsers like OCR, Strings, Translators etc? Always on ParseContext?Properties? Custom xml config? Tika config xml? Other? Combination?
Nick

Re: Configuring parsers and translators

Reply via email to