Anyone have any thoughts on this?
On Fri, 8 May 2015, Nick Burch wrote:
Hi All
This came up in TIKA-1623, but I thought it might be better brought out to
the list for discussion
To configure parsers on a per-document basis, such as setting PDF
spacing tolerances, or telling Tesseract what language it should be
OCRing for, we have the *Config objects. You create one of these, use
the setters to configure it for your document, pop it onto the Parse
context and it's used when processing your document
To configure parsers and translators on a per-JVM basis, to apply to all
documents processed, it's a bit less consistent. At least some look for
a properties file with a specific name, usually in the tika namespace,
and grab their settings / keys / etc out of that. At least some expect
to find a *Config with their program path on it, even though that
remains constant between documents. None of them support getting their
settings from the Tika Config
As part of our evolution of parser preferences, we're moving towards
people either being able to set their preferences in code, or being able
to supply a Tika Config xml which sets their parser preferences or
overrides certain bits of the default. The code option works for people
who want to declare certain specific things, the Tika Config one gives
the same functionality but allows a consistent and clean way to set it
between Tika App, Tika Server and java code.
Another related example is the External Parser support. Because you can
have multiple External Parser instances in your setup, one per format /
program, we look for all the
org/apache/tika/parser/external/tika-external-parsers.xml files on the
classpath, and create parser instances based on definitions in there
What do we think about setting executable paths and keys/logins for
parsers like OCR, Strings, Translators etc? Always on ParseContext?
Properties? Custom xml config? Tika config xml? Other? Combination?
Nick