[jira] [Comment Edited] (PDFBOX-5823) StringUtil.PATTERN_SPACE memory optmisation

Jonathan Prates (Jira) Mon, 20 May 2024 05:45:26 -0700


    [ 
https://issues.apache.org/jira/browse/PDFBOX-5823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17847855#comment-17847855
 ]


Jonathan Prates edited comment on PDFBOX-5823 at 5/20/24 12:44 PM:
-------------------------------------------------------------------

Sure, I mean, contains() is slower for big strings, but not for small ones. My 
suggestion is to use a set, in order to avoid memory allocation and resolve in 
O ( 1 ) time.
{code:java}
var SPACES_SET = Set.of(" ", "\t", "\n", "\r", "\f", " x0B");{code}
Attached I've provided a simple benchmark: [^Main.java]I can see a similar 
pattern on memory allocation for regexp here.

Regarding GC, yes, memory will be cleaned in the next cycle, but since we are 
working in a web environment that has concurrent requests and a limited amount 
of memory per container, I believe less memory allocation can be beneficial.

 


was (Author: JIRAUSER305510):
Sure, I mean, contains() is slower for big strings, but not for small ones. My 
suggestion is to use a set, in order to avoid memory allocation and resolve in 
O ( 1 ) time.
{code:java}
var SPACES_SET = Set.of(" ", "\t", "\n", "\r", "\f", " x0B");{code}
Attached I've provided a simple benchmark: [^Main.java]

I can see a similar pattern on memory allocation for regexp here.

> StringUtil.PATTERN_SPACE memory optmisation
> -------------------------------------------
>
>                 Key: PDFBOX-5823
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-5823
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: PDModel
>    Affects Versions: 3.0.3 PDFBox
>            Reporter: Jonathan Prates
>            Priority: Minor
>         Attachments: Main.java, Screenshot 2024-05-19 at 22.39.10.png, 
> Screenshot 2024-05-19 at 22.40.17.png
>
>
> PDAbstractContentStream uses StringUtil.PATTERN_SPACE regexp to evaluate if a 
> word has a space in it 
> ([https://github.com/apache/pdfbox/blob/trunk/pdfbox/src/main/java/org/apache/pdfbox/pdmodel/PDAbstractContentStream.java#L1624])
> For large documents ~800 pages and small string sequences (like a regular 
> word), it causes a memory overhead (see attached), due to the several extra 
> allocations. I've replaced the regexp for space and \t using word.contains, 
> and since it's a O ( 1 ) operation that does not require extra allocations, 
> memory used has been reduced.
> What would be the implications of replacing this block for contains()?
> Since \s is [ \t\n\x0B\f\r], I believe we have a simplified version to 
> allocate less memory.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (PDFBOX-5823) StringUtil.PATTERN_SPACE memory optmisation

Reply via email to