Germán Cáseres created SOLR-7916:
------------------------------------
Summary: ExtractingDocumentLoader does not initialize context with
Parser.class key and DelegatingParser needs that key.
Key: SOLR-7916
URL: https://issues.apache.org/jira/browse/SOLR-7916
Project: Solr
Issue Type: Bug
Components: contrib - Solr Cell (Tika extraction)
Affects Versions: 5.1
Reporter: Germán Cáseres
Tika PDFParser works perfectly with Solr except when you need to extract
metadata from embedded images in PDF.
When PDFParser finds an embedded image, it tries to execute a DelegatingParser
over that image. But the problem is that DelegatingParser expects ParseContext
to have Parser.class key.
If that key is not present, it falls back to EmptyParser and inline image
metadata is not extracted.
I tried to extract metadata using standalone Tika and Tesseract OCR and it
works fine (the text from PDF and from OCRed inline images is extracted)... but
when i do the same from SolR, only the text from the PDF is extracted.
I've properly configured PDFParser.properties with "extractInlineImages true"
Also, I tried overriding the PDFParser with a custom one and adding the
following line:
{code}
context.set(Parser.class, new AutoDetectParser());
{code}
And it worked... but I think that is not correct to modify the Tika PDFParser
if it works ok when executing without SolR.
Maybe the context should be initialized properly in the SolR class:
ExtractingDocumentLoader.
Sorry for my bad English, hope this information is useful, and please tell me
if i'm doing wrong.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]