Hi all,

We are playing with the regex-based detection capabilities of Tika combined 
with ManifoldCF, and an idea came to our mind. First, the problem: for now, a 
tika server has only one configuration. Therefore, if we set a regex based 
entity extraction, it will be applied to all of the documents (for given mime 
types). So if in ManifoldCF we call the Tika server during an crawling phase, 
we cannot have different regex rules per crawling job: any job that calls the 
tika server will be processed the same way.

So here is the idea: wouldn't it be possible to make the call to a tika server 
configurable via a REST parameter/arguments, where we could set which config we 
want to use for the current call ? Something like: 
?enableNER=true&NERConfig=regex1

Regards,

Cédric
CEO
France Labs - Your knowledge, now
Datafari Enterprise Search

Reply via email to