I think it would be great to have all this in the Tika Config.

The one thing then is to provide an example default config and
to make it *hugely* clear rather than all the levels of indirection
that we currently have going on which makes it super hard when
there is a config error (SPI, swallowing print messages, etc.)


++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: [email protected]
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++




-----Original Message-----
From: Tyler Palsulich <[email protected]>
Reply-To: "[email protected]" <[email protected]>
Date: Saturday, June 6, 2015 at 3:45 PM
To: "[email protected]" <[email protected]>
Subject: Re: Configuring parsers and translators

>Hi Nick,
>
>I've been mulling this over since you sent the first message. But, I'm
>afraid I don't have a good solution or developed ideas.
>
>I agree, it would be very nice to consolidate all configuration for all
>parsers in the server and app.
>
>Is it feasible to put everything into tika-config? Then Parser
>implementations would read the config to pull out their own configuration.
>Or, would it be better to keep some configuration separate? Documentation
>would be an issue if every parser defines its own metadata keys... But, it
>might be an improvement since we don't have "free form" properties and
>configuration files.
>
>Tyler
>
>On Sat, Jun 6, 2015 at 12:30 PM Nick Burch <[email protected]> wrote:
>
>> Anyone have any thoughts on this?
>>
>> On Fri, 8 May 2015, Nick Burch wrote:
>> > Hi All
>> >
>> > This came up in TIKA-1623, but I thought it might be better brought
>>out
>> to
>> > the list for discussion
>> >
>> > To configure parsers on a per-document basis, such as setting PDF
>> > spacing tolerances, or telling Tesseract what language it should be
>> > OCRing for, we have the *Config objects. You create one of these, use
>> > the setters to configure it for your document, pop it onto the Parse
>> > context and it's used when processing your document
>> >
>> > To configure parsers and translators on a per-JVM basis, to apply to
>>all
>> > documents processed, it's a bit less consistent. At least some look
>>for
>> > a properties file with a specific name, usually in the tika namespace,
>> > and grab their settings / keys / etc out of that. At least some expect
>> > to find a *Config with their program path on it, even though that
>> > remains constant between documents. None of them support getting their
>> > settings from the Tika Config
>> >
>> >
>> > As part of our evolution of parser preferences, we're moving towards
>> > people either being able to set their preferences in code, or being
>>able
>> > to supply a Tika Config xml which sets their parser preferences or
>> > overrides certain bits of the default. The code option works for
>>people
>> > who want to declare certain specific things, the Tika Config one gives
>> > the same functionality but allows a consistent and clean way to set it
>> > between Tika App, Tika Server and java code.
>> >
>> > Another related example is the External Parser support. Because you
>>can
>> > have multiple External Parser instances in your setup, one per format
>>/
>> > program, we look for all the
>> > org/apache/tika/parser/external/tika-external-parsers.xml files on the
>> > classpath, and create parser instances based on definitions in there
>> >
>> >
>> > What do we think about setting executable paths and keys/logins for
>> > parsers like OCR, Strings, Translators etc? Always on ParseContext?
>> > Properties? Custom xml config? Tika config xml? Other? Combination?
>> >
>> > Nick
>> >
>>

Reply via email to