[
https://issues.apache.org/jira/browse/SOLR-7916?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Eric Pugh resolved SOLR-7916.
-----------------------------
Resolution: Won't Fix
In Solr 10 we are leveraging either Tika Server (running in it's own seperate
server process) or maybe Tika Pipes (again, running in a seperate JVM).
Please revalidate your issue against Solr 10 with one of those options, and if
it is still present need, happy to work with you on a fix using the new
approach for Tika.
> ExtractingDocumentLoader does not initialize context with Parser.class key
> and DelegatingParser needs that key.
> ---------------------------------------------------------------------------------------------------------------
>
> Key: SOLR-7916
> URL: https://issues.apache.org/jira/browse/SOLR-7916
> Project: Solr
> Issue Type: Bug
> Components: contrib - Solr Cell (Tika extraction)
> Affects Versions: 5.1
> Reporter: Germán Cáseres
> Priority: Major
>
> Tika PDFParser works perfectly with Solr except when you need to extract
> metadata from embedded images in PDF.
> When PDFParser finds an embedded image, it tries to execute a
> DelegatingParser over that image. But the problem is that DelegatingParser
> expects ParseContext to have Parser.class key.
> If that key is not present, it falls back to EmptyParser and inline image
> metadata is not extracted.
> I tried to extract metadata using standalone Tika and Tesseract OCR and it
> works fine (the text from PDF and from OCRed inline images is extracted)...
> but when i do the same from SolR, only the text from the PDF is extracted.
> I've properly configured PDFParser.properties with "extractInlineImages true"
> Also, I tried overriding the PDFParser with a custom one and adding the
> following line:
> {code}
> context.set(Parser.class, new AutoDetectParser());
> {code}
> And it worked... but I think that is not correct to modify the Tika PDFParser
> if it works ok when executing without SolR.
> Maybe the context should be initialized properly in the SolR class:
> ExtractingDocumentLoader.
> Sorry for my bad English, hope this information is useful, and please tell me
> if i'm doing wrong.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]