Here's a first attempt at documentation: https://cwiki.apache.org/confluence/display/TIKA/Configuring+Parsers+At+Parse+Time+in+tika-server
Please let me know if you have any questions or want write access to improve the documentation! On Wed, Feb 15, 2023 at 11:07 AM Julien Massiera <[email protected]> wrote: > > Hi Tim, > > bouncing back on our mail thread, could you share more documentation on how > to use the header to configure the PDFParser on the fly ? > > Thanks, > Julien > > -----Message d'origine----- > De : Julien Massiera <[email protected]> > Envoyé : vendredi 3 février 2023 13:08 > À : [email protected] > Objet : RE: Adding arguments to configure tika from the rest calls > > Hi Tim, > > The NER Parse config via headers like the PDFParserConfig sounds an > interesting approach but I have just discovered that feature thanks to your > reply and I tried to find a documentation about this, unfortunately the only > thing I found was a TBD note on that page > https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=109454066 > > Could you tell us more about how to use it ? so that we can test it to have a > better idea on how it works and how useful would it be for NER ? > > Thanks, > Julien > > -----Message d'origine----- > De : Tim Allison <[email protected]> > Envoyé : mardi 31 janvier 2023 13:19 > À : [email protected] > Objet : Re: Adding arguments to configure tika from the rest calls > > Configuring specific parsers that don't have their own parser config objects > is a pain. For example, we currently have an option to set PDFParserConfig > and TesseractParserConfig options via headers to tika-server...and we have a > way to extend this functionality to other parsers. This option is "not > pretty"(TM), but it has the benefit of correctly differentiating > creation-time settings (applies to all > files) from runtime-settings (applies to a specific file), and this process > reuses a single static parser so there's no overhead in rebuilding the parser > object for every file. > > So, we could add an ner parse config along the lines of the PDFParserConfig, > or... > > ...I regret I can't tell if this is what you're proposing, but we could > specify a tika-config.xml file via url parameters? This would add overhead > of loading the full parser for each parse where you specify your own custom > parser. Or, I guess, we could load x many default parsers and name them? > > On Tue, Jan 31, 2023 at 5:34 AM Cedric Ulmer <[email protected]> > wrote: > > > > Hi all, > > > > We are playing with the regex-based detection capabilities of Tika combined > > with ManifoldCF, and an idea came to our mind. First, the problem: for now, > > a tika server has only one configuration. Therefore, if we set a regex > > based entity extraction, it will be applied to all of the documents (for > > given mime types). So if in ManifoldCF we call the Tika server during an > > crawling phase, we cannot have different regex rules per crawling job: any > > job that calls the tika server will be processed the same way. > > > > So here is the idea: wouldn't it be possible to make the call to a > > tika server configurable via a REST parameter/arguments, where we > > could set which config we want to use for the current call ? Something > > like: ?enableNER=true&NERConfig=regex1 > > > > Regards, > > > > Cédric > > CEO > > France Labs - Your knowledge, now > > Datafari Enterprise Search > > > >
