[ https://issues.apache.org/jira/browse/PDFBOX-5902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17906700#comment-17906700 ]
Andreas Lehmkühler commented on PDFBOX-5902: -------------------------------------------- Similar to the cache for CMap-mappings, I've implemented sort of a cache of Integer and byte[] values with are heavily used for CMap-mappings. The given pdf produces several millions of such objects during text extraction and many of them are duplicates. My latest optimization reduces those instances dramatically, which saves a lot of memory and more important lowers reduces the complexity for the garbage collector. Those changes are most effective when it comes to pdfs containing hundreds of pages, every page with its own fonts using similar toUnicode mappings. For simple pdfs the impact most likely is small or not existent. > The CPU usage of a PDF file with a size of 85.6 MB is abnormal > -------------------------------------------------------------- > > Key: PDFBOX-5902 > URL: https://issues.apache.org/jira/browse/PDFBOX-5902 > Project: PDFBox > Issue Type: Bug > Affects Versions: 2.0.31, 3.0.2 PDFBox > Reporter: ltzzZ > Assignee: Andreas Lehmkühler > Priority: Major > Attachments: image-2024-11-15-17-07-17-802.png, > image-2024-11-16-12-23-59-684.png, image-2024-11-16-12-38-54-861.png, > image-2024-11-19-08-50-37-171.png, image-2024-11-19-08-55-59-315.png, > image-2024-11-19-08-56-23-894.png, image-2024-11-19-08-56-49-755.png > > > When I try to extract the text content from a pdf file with a size of 85.6MB, > at this point the CPU usage is abnormal, the threshold of the alarm is > reached, and the extraction speed is also very slow, didn't get results for a > few minutes, not a memory problem, also tried to upgrade the version of the > library, this problem still exists. > !image-2024-11-15-17-07-17-802.png! -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org