[
https://issues.apache.org/jira/browse/TIKA-1477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dave Meikle updated TIKA-1477:
------------------------------
Description:
The _TesseractOCRParser_ and _PDFParser_ provide different configuration
options via their dedicated config classes (_TesseractOCRConfig_ and
_PDFParserConfig_). The settings these provide can be configured by creating an
instance of the class and setting on the _ParseContext_ used during parsing.
Whilst these can be set globally in configuration files via the classpath, it
would also be good to allow these to be overridden for individual requests
using custom HTTP Headers.
It is proposed these are essentially made up of the following:
* X-Tika-OCR<Property Name> for _TesseractOCRConfig_
* X-Tika-PDF<Property Name> for _PDFParserConfig_
For example, to set the language for the OCR parser you could send:
{noformat}
curl -T /path/to/somefile.pdf http://localhost:9998/tika --header
"X-Tika-OCRLanguage: fra"
{noformat}
Or to ask the PDF Parser to extract inline images you could send:
{noformat}
curl -T /path/to/somefile.pdf http://localhost:9998/tika --header
"X-Tika-PDFExtractInlineImages: true"
{noformat}
Properties set that do not exist would raise an HTTP 500 error.
was:
The _TesseractOCRParser_ relies on different language models to accurately OCR
content written in different languages. At present, the Tika Server provides
no way to specify additional specific languages without code changes.
To enable clients to ask for processing to be performed using specific language
models, we should add an optional new custom HTTP header (e.g.
X-Tika-OCRLanguage) which will override the TesseractOCRConfig language value
and set it on the ParseContext for use during parsing.
> Add custom header processing to allow overriding of OCR and PDF configuration
> to be used in Tika Server
> -------------------------------------------------------------------------------------------------------
>
> Key: TIKA-1477
> URL: https://issues.apache.org/jira/browse/TIKA-1477
> Project: Tika
> Issue Type: Bug
> Components: server
> Reporter: Dave Meikle
> Assignee: Dave Meikle
> Priority: Minor
> Fix For: 1.7
>
>
> The _TesseractOCRParser_ and _PDFParser_ provide different configuration
> options via their dedicated config classes (_TesseractOCRConfig_ and
> _PDFParserConfig_). The settings these provide can be configured by creating
> an instance of the class and setting on the _ParseContext_ used during
> parsing.
> Whilst these can be set globally in configuration files via the classpath, it
> would also be good to allow these to be overridden for individual requests
> using custom HTTP Headers.
> It is proposed these are essentially made up of the following:
> * X-Tika-OCR<Property Name> for _TesseractOCRConfig_
> * X-Tika-PDF<Property Name> for _PDFParserConfig_
> For example, to set the language for the OCR parser you could send:
> {noformat}
> curl -T /path/to/somefile.pdf http://localhost:9998/tika --header
> "X-Tika-OCRLanguage: fra"
> {noformat}
> Or to ask the PDF Parser to extract inline images you could send:
> {noformat}
> curl -T /path/to/somefile.pdf http://localhost:9998/tika --header
> "X-Tika-PDFExtractInlineImages: true"
> {noformat}
> Properties set that do not exist would raise an HTTP 500 error.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)