Jonathan Prates created PDFBOX-5823:
---------------------------------------
Summary: StringUtil.PATTERN_SPACE memory optmisation
Key: PDFBOX-5823
URL: https://issues.apache.org/jira/browse/PDFBOX-5823
Project: PDFBox
Issue Type: Improvement
Components: PDModel
Affects Versions: 3.0.3 PDFBox
Reporter: Jonathan Prates
Attachments: Screenshot 2024-05-19 at 22.39.10.png, Screenshot
2024-05-19 at 22.40.17.png
PDAbstractContentStream uses StringUtil.PATTERN_SPACE regexp to evaluate if a
word has a space in it
([https://github.com/apache/pdfbox/blob/trunk/pdfbox/src/main/java/org/apache/pdfbox/pdmodel/PDAbstractContentStream.java#L1624])
For large documents ~800 pages and small string sequences (like a regular
word), it causes a memory overhead (see attached), due to the several extra
allocations. I've replaced the regexp for space and \t using word.contains, and
since it's a O(n) operation that does not require extra allocations, memory
used has been reduced.
What would be the implications of replacing this block for contains()?
Since \s is [ \t\n\x0B\f\r], I believe we have a simplified version to allocate
less memory.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]