(Devil's advocate hat slightly on.) My one hesitation about putting it all into tika-config is that the default might get to be a monstrosity -- difficult for new users to use.
Tyler On Sat, Jun 6, 2015 at 3:48 PM Mattmann, Chris A (3980) < [email protected]> wrote: > I think it would be great to have all this in the Tika Config. > > The one thing then is to provide an example default config and > to make it *hugely* clear rather than all the levels of indirection > that we currently have going on which makes it super hard when > there is a config error (SPI, swallowing print messages, etc.) > > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > Chris Mattmann, Ph.D. > Chief Architect > Instrument Software and Science Data Systems Section (398) > NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA > Office: 168-519, Mailstop: 168-527 > Email: [email protected] > WWW: http://sunset.usc.edu/~mattmann/ > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > Adjunct Associate Professor, Computer Science Department > University of Southern California, Los Angeles, CA 90089 USA > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > > > > -----Original Message----- > From: Tyler Palsulich <[email protected]> > Reply-To: "[email protected]" <[email protected]> > Date: Saturday, June 6, 2015 at 3:45 PM > To: "[email protected]" <[email protected]> > Subject: Re: Configuring parsers and translators > > >Hi Nick, > > > >I've been mulling this over since you sent the first message. But, I'm > >afraid I don't have a good solution or developed ideas. > > > >I agree, it would be very nice to consolidate all configuration for all > >parsers in the server and app. > > > >Is it feasible to put everything into tika-config? Then Parser > >implementations would read the config to pull out their own configuration. > >Or, would it be better to keep some configuration separate? Documentation > >would be an issue if every parser defines its own metadata keys... But, it > >might be an improvement since we don't have "free form" properties and > >configuration files. > > > >Tyler > > > >On Sat, Jun 6, 2015 at 12:30 PM Nick Burch <[email protected]> wrote: > > > >> Anyone have any thoughts on this? > >> > >> On Fri, 8 May 2015, Nick Burch wrote: > >> > Hi All > >> > > >> > This came up in TIKA-1623, but I thought it might be better brought > >>out > >> to > >> > the list for discussion > >> > > >> > To configure parsers on a per-document basis, such as setting PDF > >> > spacing tolerances, or telling Tesseract what language it should be > >> > OCRing for, we have the *Config objects. You create one of these, use > >> > the setters to configure it for your document, pop it onto the Parse > >> > context and it's used when processing your document > >> > > >> > To configure parsers and translators on a per-JVM basis, to apply to > >>all > >> > documents processed, it's a bit less consistent. At least some look > >>for > >> > a properties file with a specific name, usually in the tika namespace, > >> > and grab their settings / keys / etc out of that. At least some expect > >> > to find a *Config with their program path on it, even though that > >> > remains constant between documents. None of them support getting their > >> > settings from the Tika Config > >> > > >> > > >> > As part of our evolution of parser preferences, we're moving towards > >> > people either being able to set their preferences in code, or being > >>able > >> > to supply a Tika Config xml which sets their parser preferences or > >> > overrides certain bits of the default. The code option works for > >>people > >> > who want to declare certain specific things, the Tika Config one gives > >> > the same functionality but allows a consistent and clean way to set it > >> > between Tika App, Tika Server and java code. > >> > > >> > Another related example is the External Parser support. Because you > >>can > >> > have multiple External Parser instances in your setup, one per format > >>/ > >> > program, we look for all the > >> > org/apache/tika/parser/external/tika-external-parsers.xml files on the > >> > classpath, and create parser instances based on definitions in there > >> > > >> > > >> > What do we think about setting executable paths and keys/logins for > >> > parsers like OCR, Strings, Translators etc? Always on ParseContext? > >> > Properties? Custom xml config? Tika config xml? Other? Combination? > >> > > >> > Nick > >> > > >> > >
