[ 
https://issues.apache.org/jira/browse/PDFBOX-5902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17899611#comment-17899611
 ] 

Axel Howind commented on PDFBOX-5902:
-------------------------------------

[~chain] It seems for your document there are a lot of dynamically created 
strings with the same content. Without string deduplication, each of these 
copies is held in memory. When string deduplication is enabled, the JVM scans 
string instances in the background and cleans up the duplicates, replacing the 
references to the copies with references to a single reference. It is possible 
because string is immutable. This will reduce memory consumption.

I think the reason this makes such a huge difference in your case is because so 
many different Hanzi charcters are used in the document. In western languages, 
you would usually have well under 100 (uppercase and lowercase characters, 
digits, some punctuation) different characters in your document. But with CJK 
(Chinese, Japanese, Korean) languages, you will have hundreds or thousands of 
different characters in your document.

As a side note, I am still totally awed by chinese and japanese children who 
manage to learn several characters every day so that they know several thousand 
different characters when they finish school, as I once tried to learn chinese 
and even after months still needed help from my teacher when reading the 
simplest text from a children's book.

Just as reading chinese texts is so different for us humans, it poses different 
challenges for software that is processing chinese texts. Be it the sheer 
amount of characters, finding word boundaries, unusual punctuation, and many 
more.  Since most software is tested mainly with languages based on some 
descendent of the latin alphabet and assumptions that we as western people make 
about the structure of texts (like words being separated by whitespace), there 
will often be issues that come up when the software is used with a language 
where those assumptions do not hold. So it's even more important to have people 
using CJK languages to test and report issues.


> The CPU usage of a PDF file with a size of 85.6 MB is abnormal
> --------------------------------------------------------------
>
>                 Key: PDFBOX-5902
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-5902
>             Project: PDFBox
>          Issue Type: Bug
>    Affects Versions: 2.0.31, 3.0.2 PDFBox
>            Reporter: ltzzZ
>            Priority: Major
>         Attachments: image-2024-11-15-17-07-17-802.png, 
> image-2024-11-16-12-23-59-684.png, image-2024-11-16-12-38-54-861.png, 
> image-2024-11-19-08-50-37-171.png, image-2024-11-19-08-55-59-315.png, 
> image-2024-11-19-08-56-23-894.png, image-2024-11-19-08-56-49-755.png
>
>
> When I try to extract the text content from a pdf file with a size of 85.6MB, 
> at this point the CPU usage is abnormal, the threshold of the alarm is 
> reached, and the extraction speed is also very slow, didn't get results for a 
> few minutes, not a memory problem, also tried to upgrade the version of the 
> library, this problem still exists.
> !image-2024-11-15-17-07-17-802.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

Reply via email to