Configuring specific parsers that don't have their own parser config
objects is a pain.  For example, we currently have an option to set
PDFParserConfig and TesseractParserConfig options via headers to
tika-server...and we have a way to extend this functionality to other
parsers.  This option is "not pretty"(TM), but it has the benefit of
correctly differentiating creation-time settings (applies to all
files) from runtime-settings (applies to a specific file), and this
process reuses a single static parser so there's no overhead in
rebuilding the parser object for every file.

So, we could add an ner parse config along the lines of the
PDFParserConfig, or...

...I regret I can't tell if this is what you're proposing, but we
could specify a tika-config.xml file via url parameters?  This would
add overhead of loading the full parser for each parse where you
specify your own custom parser.  Or, I guess, we could load x many
default parsers and name them?

On Tue, Jan 31, 2023 at 5:34 AM Cedric Ulmer
<[email protected]> wrote:
>
> Hi all,
>
> We are playing with the regex-based detection capabilities of Tika combined 
> with ManifoldCF, and an idea came to our mind. First, the problem: for now, a 
> tika server has only one configuration. Therefore, if we set a regex based 
> entity extraction, it will be applied to all of the documents (for given mime 
> types). So if in ManifoldCF we call the Tika server during an crawling phase, 
> we cannot have different regex rules per crawling job: any job that calls the 
> tika server will be processed the same way.
>
> So here is the idea: wouldn't it be possible to make the call to a tika 
> server configurable via a REST parameter/arguments, where we could set which 
> config we want to use for the current call ? Something like: 
> ?enableNER=true&NERConfig=regex1
>
> Regards,
>
> Cédric
> CEO
> France Labs - Your knowledge, now
> Datafari Enterprise Search
>

Reply via email to