Here's a first attempt at documentation:
https://cwiki.apache.org/confluence/display/TIKA/Configuring+Parsers+At+Parse+Time+in+tika-server

Please let me know if you have any questions or want write access to
improve the documentation!

On Wed, Feb 15, 2023 at 11:07 AM Julien Massiera
<[email protected]> wrote:
>
> Hi Tim,
>
> bouncing back on our mail thread, could you share more documentation on how 
> to use the header to configure the PDFParser on the fly ?
>
> Thanks,
> Julien
>
> -----Message d'origine-----
> De : Julien Massiera <[email protected]>
> Envoyé : vendredi 3 février 2023 13:08
> À : [email protected]
> Objet : RE: Adding arguments to configure tika from the rest calls
>
> Hi Tim,
>
> The NER Parse config via headers like the PDFParserConfig sounds an 
> interesting approach but I have just discovered that feature thanks to your 
> reply and I tried to find a documentation about this, unfortunately the only 
> thing I found was a TBD note on that page 
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=109454066
>
> Could you tell us more about how to use it ? so that we can test it to have a 
> better idea on how it works and how useful would it be for NER ?
>
> Thanks,
> Julien
>
> -----Message d'origine-----
> De : Tim Allison <[email protected]>
> Envoyé : mardi 31 janvier 2023 13:19
> À : [email protected]
> Objet : Re: Adding arguments to configure tika from the rest calls
>
> Configuring specific parsers that don't have their own parser config objects 
> is a pain.  For example, we currently have an option to set PDFParserConfig 
> and TesseractParserConfig options via headers to tika-server...and we have a 
> way to extend this functionality to other parsers.  This option is "not 
> pretty"(TM), but it has the benefit of correctly differentiating 
> creation-time settings (applies to all
> files) from runtime-settings (applies to a specific file), and this process 
> reuses a single static parser so there's no overhead in rebuilding the parser 
> object for every file.
>
> So, we could add an ner parse config along the lines of the PDFParserConfig, 
> or...
>
> ...I regret I can't tell if this is what you're proposing, but we could 
> specify a tika-config.xml file via url parameters?  This would add overhead 
> of loading the full parser for each parse where you specify your own custom 
> parser.  Or, I guess, we could load x many default parsers and name them?
>
> On Tue, Jan 31, 2023 at 5:34 AM Cedric Ulmer <[email protected]> 
> wrote:
> >
> > Hi all,
> >
> > We are playing with the regex-based detection capabilities of Tika combined 
> > with ManifoldCF, and an idea came to our mind. First, the problem: for now, 
> > a tika server has only one configuration. Therefore, if we set a regex 
> > based entity extraction, it will be applied to all of the documents (for 
> > given mime types). So if in ManifoldCF we call the Tika server during an 
> > crawling phase, we cannot have different regex rules per crawling job: any 
> > job that calls the tika server will be processed the same way.
> >
> > So here is the idea: wouldn't it be possible to make the call to a
> > tika server configurable via a REST parameter/arguments, where we
> > could set which config we want to use for the current call ? Something
> > like: ?enableNER=true&NERConfig=regex1
> >
> > Regards,
> >
> > Cédric
> > CEO
> > France Labs - Your knowledge, now
> > Datafari Enterprise Search
> >
>
>

Reply via email to