[ https://issues.apache.org/jira/browse/PDFBOX-5823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17847855#comment-17847855 ]
Jonathan Prates edited comment on PDFBOX-5823 at 5/20/24 12:44 PM: ------------------------------------------------------------------- Sure, I mean, contains() is slower for big strings, but not for small ones. My suggestion is to use a set, in order to avoid memory allocation and resolve in O ( 1 ) time. {code:java} var SPACES_SET = Set.of(" ", "\t", "\n", "\r", "\f", " x0B");{code} Attached I've provided a simple benchmark: [^Main.java]I can see a similar pattern on memory allocation for regexp here. Regarding GC, yes, memory will be cleaned in the next cycle, but since we are working in a web environment that has concurrent requests and a limited amount of memory per container, I believe less memory allocation can be beneficial. was (Author: JIRAUSER305510): Sure, I mean, contains() is slower for big strings, but not for small ones. My suggestion is to use a set, in order to avoid memory allocation and resolve in O ( 1 ) time. {code:java} var SPACES_SET = Set.of(" ", "\t", "\n", "\r", "\f", " x0B");{code} Attached I've provided a simple benchmark: [^Main.java] I can see a similar pattern on memory allocation for regexp here. > StringUtil.PATTERN_SPACE memory optmisation > ------------------------------------------- > > Key: PDFBOX-5823 > URL: https://issues.apache.org/jira/browse/PDFBOX-5823 > Project: PDFBox > Issue Type: Improvement > Components: PDModel > Affects Versions: 3.0.3 PDFBox > Reporter: Jonathan Prates > Priority: Minor > Attachments: Main.java, Screenshot 2024-05-19 at 22.39.10.png, > Screenshot 2024-05-19 at 22.40.17.png > > > PDAbstractContentStream uses StringUtil.PATTERN_SPACE regexp to evaluate if a > word has a space in it > ([https://github.com/apache/pdfbox/blob/trunk/pdfbox/src/main/java/org/apache/pdfbox/pdmodel/PDAbstractContentStream.java#L1624]) > For large documents ~800 pages and small string sequences (like a regular > word), it causes a memory overhead (see attached), due to the several extra > allocations. I've replaced the regexp for space and \t using word.contains, > and since it's a O ( 1 ) operation that does not require extra allocations, > memory used has been reduced. > What would be the implications of replacing this block for contains()? > Since \s is [ \t\n\x0B\f\r], I believe we have a simplified version to > allocate less memory. > -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org