Hi Nick, I've been mulling this over since you sent the first message. But, I'm afraid I don't have a good solution or developed ideas.
I agree, it would be very nice to consolidate all configuration for all parsers in the server and app. Is it feasible to put everything into tika-config? Then Parser implementations would read the config to pull out their own configuration. Or, would it be better to keep some configuration separate? Documentation would be an issue if every parser defines its own metadata keys... But, it might be an improvement since we don't have "free form" properties and configuration files. Tyler On Sat, Jun 6, 2015 at 12:30 PM Nick Burch <[email protected]> wrote: > Anyone have any thoughts on this? > > On Fri, 8 May 2015, Nick Burch wrote: > > Hi All > > > > This came up in TIKA-1623, but I thought it might be better brought out > to > > the list for discussion > > > > To configure parsers on a per-document basis, such as setting PDF > > spacing tolerances, or telling Tesseract what language it should be > > OCRing for, we have the *Config objects. You create one of these, use > > the setters to configure it for your document, pop it onto the Parse > > context and it's used when processing your document > > > > To configure parsers and translators on a per-JVM basis, to apply to all > > documents processed, it's a bit less consistent. At least some look for > > a properties file with a specific name, usually in the tika namespace, > > and grab their settings / keys / etc out of that. At least some expect > > to find a *Config with their program path on it, even though that > > remains constant between documents. None of them support getting their > > settings from the Tika Config > > > > > > As part of our evolution of parser preferences, we're moving towards > > people either being able to set their preferences in code, or being able > > to supply a Tika Config xml which sets their parser preferences or > > overrides certain bits of the default. The code option works for people > > who want to declare certain specific things, the Tika Config one gives > > the same functionality but allows a consistent and clean way to set it > > between Tika App, Tika Server and java code. > > > > Another related example is the External Parser support. Because you can > > have multiple External Parser instances in your setup, one per format / > > program, we look for all the > > org/apache/tika/parser/external/tika-external-parsers.xml files on the > > classpath, and create parser instances based on definitions in there > > > > > > What do we think about setting executable paths and keys/logins for > > parsers like OCR, Strings, Translators etc? Always on ParseContext? > > Properties? Custom xml config? Tika config xml? Other? Combination? > > > > Nick > > >
