[ 
https://issues.apache.org/jira/browse/SOLR-7916?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Pugh resolved SOLR-7916.
-----------------------------
    Resolution: Won't Fix

In Solr 10 we are leveraging either Tika Server (running in it's own seperate 
server process) or maybe Tika Pipes (again, running in a seperate JVM).   
Please revalidate your issue against Solr 10 with one of those options, and if 
it is still present need, happy to work with you on a fix using the new 
approach for Tika.

> ExtractingDocumentLoader does not initialize context with Parser.class key 
> and DelegatingParser needs that key.
> ---------------------------------------------------------------------------------------------------------------
>
>                 Key: SOLR-7916
>                 URL: https://issues.apache.org/jira/browse/SOLR-7916
>             Project: Solr
>          Issue Type: Bug
>          Components: contrib - Solr Cell (Tika extraction)
>    Affects Versions: 5.1
>            Reporter: Germán Cáseres
>            Priority: Major
>
> Tika PDFParser works perfectly with Solr except when you need to extract 
> metadata from embedded images in PDF.
> When PDFParser finds an embedded image, it tries to execute a 
> DelegatingParser over that image. But the problem is that DelegatingParser 
> expects ParseContext to have Parser.class key.
> If that key is not present, it falls back to EmptyParser and inline image 
> metadata is not extracted.
> I tried to extract metadata using standalone Tika and Tesseract OCR and it 
> works fine (the text from PDF and from OCRed inline images is extracted)... 
> but when i do the same from SolR, only the text from the PDF is extracted.
> I've properly configured PDFParser.properties with "extractInlineImages true"
> Also, I tried overriding the PDFParser with a custom one and adding the 
> following line:
> {code}
> context.set(Parser.class, new AutoDetectParser());
> {code}
> And it worked... but I think that is not correct to modify the Tika PDFParser 
> if it works ok when executing without SolR.
> Maybe the context should be initialized properly in the SolR class: 
> ExtractingDocumentLoader.
> Sorry for my bad English, hope this information is useful, and please tell me 
> if i'm doing wrong.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to