[jira] [Updated] (TIKA-1477) Add custom header processing to allow overriding of OCR and PDF configuration to be used in Tika Server

Dave Meikle (JIRA) Thu, 20 Nov 2014 02:14:52 -0800

     [ 
https://issues.apache.org/jira/browse/TIKA-1477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Dave Meikle updated TIKA-1477:
------------------------------
    Description: 
The _TesseractOCRParser_ and _PDFParser_ provide different configuration 
options via their dedicated config classes (_TesseractOCRConfig_ and 
_PDFParserConfig_). The settings these provide can be configured by creating an 
instance of the class and setting on the _ParseContext_ used during parsing.

Whilst these can be set globally in configuration files via the classpath, it 
would also be good to allow these to be overridden for individual requests 
using custom HTTP Headers.

It is proposed these are essentially made up of the following:
 * X-Tika-OCR<Property Name> for _TesseractOCRConfig_
 * X-Tika-PDF<Property Name> for _PDFParserConfig_

For example, to set the language for the OCR parser you could send:
{noformat}
curl -T /path/to/somefile.pdf http://localhost:9998/tika --header 
"X-Tika-OCRLanguage: fra"
{noformat}

Or to ask the PDF Parser to extract inline images you could send:
{noformat}
curl -T /path/to/somefile.pdf http://localhost:9998/tika --header 
"X-Tika-PDFExtractInlineImages: true"
{noformat}

Properties set that do not exist would raise an HTTP 500 error.

  was:
The _TesseractOCRParser_ relies on different language models to accurately OCR 
content written in different languages.  At present, the Tika Server provides 
no way to specify additional specific languages without code changes.

To enable clients to ask for processing to be performed using specific language 
models, we should add an optional new custom HTTP header (e.g. 
X-Tika-OCRLanguage) which will override the TesseractOCRConfig language value 
and set it on the ParseContext for use during parsing.


> Add custom header processing to allow overriding of OCR and PDF configuration 
> to be used in Tika Server
> -------------------------------------------------------------------------------------------------------
>
>                 Key: TIKA-1477
>                 URL: https://issues.apache.org/jira/browse/TIKA-1477
>             Project: Tika
>          Issue Type: Bug
>          Components: server
>            Reporter: Dave Meikle
>            Assignee: Dave Meikle
>            Priority: Minor
>             Fix For: 1.7
>
>
> The _TesseractOCRParser_ and _PDFParser_ provide different configuration 
> options via their dedicated config classes (_TesseractOCRConfig_ and 
> _PDFParserConfig_). The settings these provide can be configured by creating 
> an instance of the class and setting on the _ParseContext_ used during 
> parsing.
> Whilst these can be set globally in configuration files via the classpath, it 
> would also be good to allow these to be overridden for individual requests 
> using custom HTTP Headers.
> It is proposed these are essentially made up of the following:
>  * X-Tika-OCR<Property Name> for _TesseractOCRConfig_
>  * X-Tika-PDF<Property Name> for _PDFParserConfig_
> For example, to set the language for the OCR parser you could send:
> {noformat}
> curl -T /path/to/somefile.pdf http://localhost:9998/tika --header 
> "X-Tika-OCRLanguage: fra"
> {noformat}
> Or to ask the PDF Parser to extract inline images you could send:
> {noformat}
> curl -T /path/to/somefile.pdf http://localhost:9998/tika --header 
> "X-Tika-PDFExtractInlineImages: true"
> {noformat}
> Properties set that do not exist would raise an HTTP 500 error.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (TIKA-1477) Add custom header processing to allow overriding of OCR and PDF configuration to be used in Tika Server

Reply via email to