[ https://issues.apache.org/jira/browse/PDFBOX-5902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17899316#comment-17899316 ]
Axel Howind commented on PDFBOX-5902: ------------------------------------- This is marked as affecting versions 2.0.31 and 3.0.2. Have you tested with 3.0.3 yet? There were multiple performance bugs fixed in that version, here is a list of fixes and improvements that could be related to your problem (based on the profiler output showing it has to do with font handling and dynamic string creation): [PDFBOX-5790] - Don't use a predefined CMap if a ToUnicode CMap is present [PDFBOX-5799] - Page with thousands of content streams takes extremely long to render or extract [PDFBOX-5809] - PDDocument#importPage slowed down by factor 1300 [PDFBOX-5845] - potential memory leak in TrueTypeCollection.java [PDFBOX-5675] - org.apache.pdfbox.filter.Filter#decode() Java heap space [PDFBOX-5819] - Make Type2CharStringParser thread-safe [PDFBOX-5823] - StringUtil.PATTERN_SPACE memory optmisation [PDFBOX-5847] - Improve performance of FileSystemFontProvider.scanFonts() If it is not fixed by 3.0.3: It looks like at least in part it is caused by dynamically creating Strings (have a look at the profiler screenshots you posted). If you are on Java 8, please try running with the G1 garbage collector and string deduplication enabled and disabled and report your results back. Use this to enable the G1 collector and enable string deduplication (the feature was introduced in Java8u20): {-XX:+UseG1GC -XX:+UseStringDeduplication -XX:+PrintStringDeduplicationStatistics} See https://openjdk.org/jeps/192 for details. If you can, also try running with Java to 11 or even better 17 as there have been massive changes to the internal String handling since Java 8. These changes are transparent to user code, but can have much impact on both memory and runtime. Even if you can only test locally and not yet update in production, it might hint in the correct direction. > The CPU usage of a PDF file with a size of 85.6 MB is abnormal > -------------------------------------------------------------- > > Key: PDFBOX-5902 > URL: https://issues.apache.org/jira/browse/PDFBOX-5902 > Project: PDFBox > Issue Type: Bug > Affects Versions: 2.0.31, 3.0.2 PDFBox > Reporter: ltzzZ > Priority: Major > Attachments: image-2024-11-15-17-07-17-802.png, > image-2024-11-16-12-23-59-684.png, image-2024-11-16-12-38-54-861.png, > image-2024-11-19-08-50-37-171.png, image-2024-11-19-08-55-59-315.png, > image-2024-11-19-08-56-23-894.png, image-2024-11-19-08-56-49-755.png > > > When I try to extract the text content from a pdf file with a size of 85.6MB, > at this point the CPU usage is abnormal, the threshold of the alarm is > reached, and the extraction speed is also very slow, didn't get results for a > few minutes, not a memory problem, also tried to upgrade the version of the > library, this problem still exists. > !image-2024-11-15-17-07-17-802.png! -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org