[
https://issues.apache.org/jira/browse/TIKA-3571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17512728#comment-17512728
]
Luís Filipe Nassif commented on TIKA-3571:
------------------------------------------
If we are going to add a PDF renderer implementation using PDFBox, using a JRE
>= 11 (or avoiding Oracle JRE 1.8), at least on Windows, should make a huge
performance difference:
[https://github.com/sepinf-inc/IPED/issues/119#issuecomment-822670734]
I wonder if [~tilman] knows anything about java 8 low level font bottlenecks...
For office files, currently we are using LibreOffice by command line.
> Add an interface for rendering engines
> --------------------------------------
>
> Key: TIKA-3571
> URL: https://issues.apache.org/jira/browse/TIKA-3571
> Project: Tika
> Issue Type: Wish
> Reporter: Tim Allison
> Priority: Major
>
> We've now seen a few requests for extracting text _and_ rendering PDFs, and
> certainly it might be useful to have alternatives for rendering files (e.g.
> this [Alfresco
> study|https://hub.alfresco.com/t5/alfresco-content-services-blog/pdf-rendering-engine-performance-and-fidelity-comparison/ba-p/287618]),
> including MSOffice or at least PPTx...
> And there are cases where users don't want the rendered images, but they do
> want OCR to be run against the rendered images.
> I doubt I'll have a chance to work on this for a while, but I wanted to open
> an issue for discussion.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)