[ 
https://issues.apache.org/jira/browse/PDFBOX-5902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17899860#comment-17899860
 ] 

Tilman Hausherr edited comment on PDFBOX-5902 at 11/20/24 8:01 PM:
-------------------------------------------------------------------

So the problem is in
{code:java}
    private static String createStringFromBytes(byte[] bytes)
    {
        return new String(bytes, bytes.length == 1 ? 
StandardCharsets.ISO_8859_1 : StandardCharsets.UTF_16BE);
    } {code}
But what can we do? If the /ToUnicode stream is big, or if there are many, then 
many strings will be created.

Re regression tests, we have that 250000 files text extraction regression test 
we do before every release. Obviously we don't look at every file, but there is 
a report that shows the differences. Then there's the rendering regression test 
which I do (sometimes several times per week). I'm not sure if we have a CJK 
text extract test in the source code download. For that we'd need a file that 
isn't copyrighted, e.g. a government PDF, or literature from somebody who died 
long ago.


was (Author: tilman):
So the problem is in
{code:java}
    private static String createStringFromBytes(byte[] bytes)
    {
        return new String(bytes, bytes.length == 1 ? 
StandardCharsets.ISO_8859_1 : StandardCharsets.UTF_16BE);
    } {code}
But what can we do? If the /ToUnicode stream is big then many strings will be 
created.

Re regression tests, we have that 250000 files text extraction regression test 
we do before every release. Obviously we don't look at every file, but there is 
a report that shows the differences. Then there's the rendering regression test 
which I do (sometimes several times per week). I'm not sure if we have a CJK 
text extract test in the source code download. For that we'd need a file that 
isn't copyrighted, e.g. a government PDF, or literature from somebody who died 
long ago.

> The CPU usage of a PDF file with a size of 85.6 MB is abnormal
> --------------------------------------------------------------
>
>                 Key: PDFBOX-5902
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-5902
>             Project: PDFBox
>          Issue Type: Bug
>    Affects Versions: 2.0.31, 3.0.2 PDFBox
>            Reporter: ltzzZ
>            Priority: Major
>         Attachments: image-2024-11-15-17-07-17-802.png, 
> image-2024-11-16-12-23-59-684.png, image-2024-11-16-12-38-54-861.png, 
> image-2024-11-19-08-50-37-171.png, image-2024-11-19-08-55-59-315.png, 
> image-2024-11-19-08-56-23-894.png, image-2024-11-19-08-56-49-755.png
>
>
> When I try to extract the text content from a pdf file with a size of 85.6MB, 
> at this point the CPU usage is abnormal, the threshold of the alarm is 
> reached, and the extraction speed is also very slow, didn't get results for a 
> few minutes, not a memory problem, also tried to upgrade the version of the 
> library, this problem still exists.
> !image-2024-11-15-17-07-17-802.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

Reply via email to