[ https://issues.apache.org/jira/browse/PDFBOX-5902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17899860#comment-17899860 ]
Tilman Hausherr commented on PDFBOX-5902: ----------------------------------------- So the problem is in {code:java} private static String createStringFromBytes(byte[] bytes) { return new String(bytes, bytes.length == 1 ? StandardCharsets.ISO_8859_1 : StandardCharsets.UTF_16BE); } {code} But what can we do? If the /ToUnicode stream is big then many strings will be created. Re regression tests, we have that 250000 files text extraction regression test we do before every release. Obviously we don't look at every file, but there is a report that shows the differences. Then there's the rendering regression test which I do (sometimes several times per week). I'm not sure if we have a CJK text extract test in the source code download. For that we'd need a file that isn't copyrighted, e.g. a government PDF, or literature from somebody who died long ago. > The CPU usage of a PDF file with a size of 85.6 MB is abnormal > -------------------------------------------------------------- > > Key: PDFBOX-5902 > URL: https://issues.apache.org/jira/browse/PDFBOX-5902 > Project: PDFBox > Issue Type: Bug > Affects Versions: 2.0.31, 3.0.2 PDFBox > Reporter: ltzzZ > Priority: Major > Attachments: image-2024-11-15-17-07-17-802.png, > image-2024-11-16-12-23-59-684.png, image-2024-11-16-12-38-54-861.png, > image-2024-11-19-08-50-37-171.png, image-2024-11-19-08-55-59-315.png, > image-2024-11-19-08-56-23-894.png, image-2024-11-19-08-56-49-755.png > > > When I try to extract the text content from a pdf file with a size of 85.6MB, > at this point the CPU usage is abnormal, the threshold of the alarm is > reached, and the extraction speed is also very slow, didn't get results for a > few minutes, not a memory problem, also tried to upgrade the version of the > library, this problem still exists. > !image-2024-11-15-17-07-17-802.png! -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org