[ 
https://issues.apache.org/jira/browse/PDFBOX-5902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17906700#comment-17906700
 ] 

Andreas Lehmkühler edited comment on PDFBOX-5902 at 12/30/24 9:45 AM:
----------------------------------------------------------------------

Similar to the cache for CMap-mappings, I've implemented sort of a cache of 
Integer and byte[] values which are heavily used for CMap-mappings. The given 
pdf produces several millions of such objects during text extraction and many 
of them are duplicates. My latest optimization reduces those instances 
dramatically, which saves a lot of memory and more important reduces the 
complexity for the garbage collector.

Those changes are most effective when it comes to pdfs containing hundreds of 
pages, every page with its own fonts using similar toUnicode mappings. For 
simple pdfs the impact most likely is small or not existent.


was (Author: lehmi):
Similar to the cache for CMap-mappings, I've implemented sort of a cache of 
Integer and byte[] values with are heavily used for CMap-mappings. The given 
pdf produces several millions of such objects during text extraction and many 
of them are duplicates. My latest optimization reduces those instances 
dramatically, which saves a lot of memory and more important lowers reduces the 
complexity for the garbage collector.

Those changes are most effective when it comes to pdfs containing hundreds of 
pages, every page with its own fonts using similar toUnicode mappings. For 
simple pdfs the impact most likely is small or not existent.

> The CPU usage of a PDF file with a size of 85.6 MB is abnormal
> --------------------------------------------------------------
>
>                 Key: PDFBOX-5902
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-5902
>             Project: PDFBox
>          Issue Type: Bug
>          Components: FontBox, Parsing
>    Affects Versions: 2.0.31, 3.0.2 PDFBox
>            Reporter: ltzzZ
>            Assignee: Andreas Lehmkühler
>            Priority: Major
>             Fix For: 2.0.33, 3.0.4 PDFBox, 4.0.0
>
>         Attachments: image-2024-11-15-17-07-17-802.png, 
> image-2024-11-16-12-23-59-684.png, image-2024-11-16-12-38-54-861.png, 
> image-2024-11-19-08-50-37-171.png, image-2024-11-19-08-55-59-315.png, 
> image-2024-11-19-08-56-23-894.png, image-2024-11-19-08-56-49-755.png
>
>
> When I try to extract the text content from a pdf file with a size of 85.6MB, 
> at this point the CPU usage is abnormal, the threshold of the alarm is 
> reached, and the extraction speed is also very slow, didn't get results for a 
> few minutes, not a memory problem, also tried to upgrade the version of the 
> library, this problem still exists.
> !image-2024-11-15-17-07-17-802.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

Reply via email to