I think it would be great to have all this in the Tika Config. The one thing then is to provide an example default config and to make it *hugely* clear rather than all the levels of indirection that we currently have going on which makes it super hard when there is a config error (SPI, swallowing print messages, etc.)
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: [email protected] WWW: http://sunset.usc.edu/~mattmann/ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ -----Original Message----- From: Tyler Palsulich <[email protected]> Reply-To: "[email protected]" <[email protected]> Date: Saturday, June 6, 2015 at 3:45 PM To: "[email protected]" <[email protected]> Subject: Re: Configuring parsers and translators >Hi Nick, > >I've been mulling this over since you sent the first message. But, I'm >afraid I don't have a good solution or developed ideas. > >I agree, it would be very nice to consolidate all configuration for all >parsers in the server and app. > >Is it feasible to put everything into tika-config? Then Parser >implementations would read the config to pull out their own configuration. >Or, would it be better to keep some configuration separate? Documentation >would be an issue if every parser defines its own metadata keys... But, it >might be an improvement since we don't have "free form" properties and >configuration files. > >Tyler > >On Sat, Jun 6, 2015 at 12:30 PM Nick Burch <[email protected]> wrote: > >> Anyone have any thoughts on this? >> >> On Fri, 8 May 2015, Nick Burch wrote: >> > Hi All >> > >> > This came up in TIKA-1623, but I thought it might be better brought >>out >> to >> > the list for discussion >> > >> > To configure parsers on a per-document basis, such as setting PDF >> > spacing tolerances, or telling Tesseract what language it should be >> > OCRing for, we have the *Config objects. You create one of these, use >> > the setters to configure it for your document, pop it onto the Parse >> > context and it's used when processing your document >> > >> > To configure parsers and translators on a per-JVM basis, to apply to >>all >> > documents processed, it's a bit less consistent. At least some look >>for >> > a properties file with a specific name, usually in the tika namespace, >> > and grab their settings / keys / etc out of that. At least some expect >> > to find a *Config with their program path on it, even though that >> > remains constant between documents. None of them support getting their >> > settings from the Tika Config >> > >> > >> > As part of our evolution of parser preferences, we're moving towards >> > people either being able to set their preferences in code, or being >>able >> > to supply a Tika Config xml which sets their parser preferences or >> > overrides certain bits of the default. The code option works for >>people >> > who want to declare certain specific things, the Tika Config one gives >> > the same functionality but allows a consistent and clean way to set it >> > between Tika App, Tika Server and java code. >> > >> > Another related example is the External Parser support. Because you >>can >> > have multiple External Parser instances in your setup, one per format >>/ >> > program, we look for all the >> > org/apache/tika/parser/external/tika-external-parsers.xml files on the >> > classpath, and create parser instances based on definitions in there >> > >> > >> > What do we think about setting executable paths and keys/logins for >> > parsers like OCR, Strings, Translators etc? Always on ParseContext? >> > Properties? Custom xml config? Tika config xml? Other? Combination? >> > >> > Nick >> > >>
