[ 
https://issues.apache.org/jira/browse/PDFBOX-5902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17899316#comment-17899316
 ] 

Axel Howind commented on PDFBOX-5902:
-------------------------------------

This is marked as affecting versions 2.0.31 and 3.0.2. Have you tested with 
3.0.3 yet? There were multiple performance bugs fixed in that version, here is 
a list of fixes and improvements that could be related to your problem (based 
on the profiler output showing it has to do with font handling and dynamic 
string creation):

[PDFBOX-5790] - Don't use a predefined CMap if a ToUnicode CMap is present
[PDFBOX-5799] - Page with thousands of content streams takes extremely long to 
render or extract
[PDFBOX-5809] - PDDocument#importPage slowed down by factor 1300
[PDFBOX-5845] - potential memory leak in TrueTypeCollection.java
[PDFBOX-5675] - org.apache.pdfbox.filter.Filter#decode() Java heap space
[PDFBOX-5819] - Make Type2CharStringParser thread-safe
[PDFBOX-5823] - StringUtil.PATTERN_SPACE memory optmisation
[PDFBOX-5847] - Improve performance of FileSystemFontProvider.scanFonts()

If it is not fixed by 3.0.3: It looks like at least in part it is caused by 
dynamically creating Strings (have a look at the profiler screenshots you 
posted). If you are on Java 8, please try running with the G1 garbage collector 
and string deduplication enabled and disabled and report your results back.

Use this to enable the G1 collector and enable string deduplication (the 
feature was introduced in Java8u20):
{-XX:+UseG1GC -XX:+UseStringDeduplication 
-XX:+PrintStringDeduplicationStatistics}

See https://openjdk.org/jeps/192 for details.

If you can, also try running with Java to 11 or even better 17 as there have 
been massive changes to the internal String handling since Java 8. These 
changes are transparent to user code, but can have much impact on both memory 
and runtime. Even if you can only test locally and not yet update in 
production, it might hint in the correct direction.


> The CPU usage of a PDF file with a size of 85.6 MB is abnormal
> --------------------------------------------------------------
>
>                 Key: PDFBOX-5902
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-5902
>             Project: PDFBox
>          Issue Type: Bug
>    Affects Versions: 2.0.31, 3.0.2 PDFBox
>            Reporter: ltzzZ
>            Priority: Major
>         Attachments: image-2024-11-15-17-07-17-802.png, 
> image-2024-11-16-12-23-59-684.png, image-2024-11-16-12-38-54-861.png, 
> image-2024-11-19-08-50-37-171.png, image-2024-11-19-08-55-59-315.png, 
> image-2024-11-19-08-56-23-894.png, image-2024-11-19-08-56-49-755.png
>
>
> When I try to extract the text content from a pdf file with a size of 85.6MB, 
> at this point the CPU usage is abnormal, the threshold of the alarm is 
> reached, and the extraction speed is also very slow, didn't get results for a 
> few minutes, not a memory problem, also tried to upgrade the version of the 
> library, this problem still exists.
> !image-2024-11-15-17-07-17-802.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

Reply via email to