[
https://issues.apache.org/jira/browse/PDFBOX-5902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17899611#comment-17899611
]
Axel Howind commented on PDFBOX-5902:
-------------------------------------
[~chain] It seems for your document there are a lot of dynamically created
strings with the same content. Without string deduplication, each of these
copies is held in memory. When string deduplication is enabled, the JVM scans
string instances in the background and cleans up the duplicates, replacing the
references to the copies with references to a single reference. It is possible
because string is immutable. This will reduce memory consumption.
I think the reason this makes such a huge difference in your case is because so
many different Hanzi charcters are used in the document. In western languages,
you would usually have well under 100 (uppercase and lowercase characters,
digits, some punctuation) different characters in your document. But with CJK
(Chinese, Japanese, Korean) languages, you will have hundreds or thousands of
different characters in your document.
As a side note, I am still totally awed by chinese and japanese children who
manage to learn several characters every day so that they know several thousand
different characters when they finish school, as I once tried to learn chinese
and even after months still needed help from my teacher when reading the
simplest text from a children's book.
Just as reading chinese texts is so different for us humans, it poses different
challenges for software that is processing chinese texts. Be it the sheer
amount of characters, finding word boundaries, unusual punctuation, and many
more. Since most software is tested mainly with languages based on some
descendent of the latin alphabet and assumptions that we as western people make
about the structure of texts (like words being separated by whitespace), there
will often be issues that come up when the software is used with a language
where those assumptions do not hold. So it's even more important to have people
using CJK languages to test and report issues.
> The CPU usage of a PDF file with a size of 85.6 MB is abnormal
> --------------------------------------------------------------
>
> Key: PDFBOX-5902
> URL: https://issues.apache.org/jira/browse/PDFBOX-5902
> Project: PDFBox
> Issue Type: Bug
> Affects Versions: 2.0.31, 3.0.2 PDFBox
> Reporter: ltzzZ
> Priority: Major
> Attachments: image-2024-11-15-17-07-17-802.png,
> image-2024-11-16-12-23-59-684.png, image-2024-11-16-12-38-54-861.png,
> image-2024-11-19-08-50-37-171.png, image-2024-11-19-08-55-59-315.png,
> image-2024-11-19-08-56-23-894.png, image-2024-11-19-08-56-49-755.png
>
>
> When I try to extract the text content from a pdf file with a size of 85.6MB,
> at this point the CPU usage is abnormal, the threshold of the alarm is
> reached, and the extraction speed is also very slow, didn't get results for a
> few minutes, not a memory problem, also tried to upgrade the version of the
> library, this problem still exists.
> !image-2024-11-15-17-07-17-802.png!
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]