[ 
https://issues.apache.org/jira/browse/PDFBOX-5902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17906089#comment-17906089
 ] 

Andreas Lehmkühler commented on PDFBOX-5902:
--------------------------------------------

As I wrote earlier, I've implemented a cache for all common one- and two-byte 
mappings which may occur in a CMap. The given pdf contains 442 pages. All pages 
have 3 fonts (I didn't check each of them), one that those fonts uses an 
identity map as toUnicode mapping. So that we end up in 442 mappings containing 
65K strings which are the very same, so that we end up in 442 instances for 
each string out of those 65K strings. The optional string de-duplication 
mechanism of the JRE might eliminate them, but IMHO it is a better strategy to 
avoid creating those strings in the first place.

The cache speeds up the text extraction of the given pdf. 

P.S.: I wasn't able to reproduce those strange effects I'd encountered 3 weeks 
ago. Maybe some issue with my system. Meanwhile I've installed some OS-updates 
and rebooted the machine ....

> The CPU usage of a PDF file with a size of 85.6 MB is abnormal
> --------------------------------------------------------------
>
>                 Key: PDFBOX-5902
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-5902
>             Project: PDFBox
>          Issue Type: Bug
>    Affects Versions: 2.0.31, 3.0.2 PDFBox
>            Reporter: ltzzZ
>            Priority: Major
>         Attachments: image-2024-11-15-17-07-17-802.png, 
> image-2024-11-16-12-23-59-684.png, image-2024-11-16-12-38-54-861.png, 
> image-2024-11-19-08-50-37-171.png, image-2024-11-19-08-55-59-315.png, 
> image-2024-11-19-08-56-23-894.png, image-2024-11-19-08-56-49-755.png
>
>
> When I try to extract the text content from a pdf file with a size of 85.6MB, 
> at this point the CPU usage is abnormal, the threshold of the alarm is 
> reached, and the extraction speed is also very slow, didn't get results for a 
> few minutes, not a memory problem, also tried to upgrade the version of the 
> library, this problem still exists.
> !image-2024-11-15-17-07-17-802.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

Reply via email to