I haven’t looked at this particular code at all, but I’m guessing that a LOT of 
the objects being referenced are strings — possibly identical strings?

It may be useful to implement object (string) pooling.    That can save a ton 
of memory.

Do not use the built-in String.intern() function for this, though.   That is 
limited and slow.    It’s better to build the string pool around something like 
ConcurrentHashMap.putIfAbsent().

You then need to rewrite the code to do a pool check whenever new strings are 
created / input.


ConcurrentHashMap stringPool = new ConcurrentHashMap();  //<— do this once and 
make it available to all your code, whether as a singleton or static.

String s = someStepThatCreatesOrInputsAString();

s = stringPool.putIfAbsent(s, s);   //<— add this step everywhere


This imposes a very tiny lookup cost with every putIfAbsent() call but it’s 
pretty small and benchmarks you can find on the ’net show its still way faster 
than String.intern(), especially for large O(n).  The putIfAbsent() call is 
atomic and this is perfectly thread safe.

The end result is that you can enforce that you will have only one copy of any 
string in memory, regardless of how many references you might have of it.

For non-String objects, this can also be used but it’s important that the 
object have proper .equals() and .hashcode() methods implemented.

I hope this suggestion is helpful.  This pattern saved me massive amounts of 
memory in clients pulling data from cloud databases. 

If it doesn’t make sense to apply in this particular code, then hopefully it 
will still prove a useful tip for someone somewhere else.

Cheers,

Mel

Dr. Mel Martinez
[email protected]



> On Jun 10, 2021, at 1:07 PM, Mark A. Claassen <[email protected]> wrote:
> 
> Thanks for the reply.
> 
>> Why should the list not be kept? We need it for when the file is saved.
> 
> I need to study that code a bit more, there is a lot going on there that I 
> don't yet understand.  What I was thinking was if there might be an 
> alternative to keeping the stream object in memory, like storing the 
> necessary metadata for it in a smaller structure.  
> 
> Maybe the stream is the perfect object for this.  However, at 4K or more a 
> piece, and one per page, this scales at least linearly with the number of 
> pages.  When dealing with "normal" documents, this is not an issue.  But when 
> the number of pages gets large, this overhead is significant. 
> 
> We had someone try to create a PDF from a 25,000 page text source.  25,000 * 
> 4K is 100 megabytes.  If it was possible to not maintain any data in the 
> ScratchFileBuffer, it would scale a bit better.
> 
> Thanks again,
> 
> Mark Claassen
> Senior Software Engineer
> 
> Donnell Systems, Inc.
> 130 South Main Street
> Leighton Plaza Suite 375
> South Bend, IN  46601
> E-mail: mailto:[email protected]
> Voice: (574)232-3784
> Fax: (574)232-4014
> 
> Disclaimer:
> The opinions provided herein do not necessarily state or reflect 
> those of Donnell Systems, Inc.(DSI). DSI makes no warranty for and 
> assumes no legal liability or responsibility for the posting. 
> 
> 
> -----Original Message-----
> From: Tilman Hausherr <[email protected]> 
> Sent: Thursday, June 10, 2021 12:02 PM
> To: [email protected]
> Subject: [Possible Spam] Re: PDF Memory issue
> Importance: Low
> 
> Why should the list not be kept? We need it for when the file is saved.
> 
> Tilman
> 
> Am 10.06.2021 um 03:07 schrieb Mark A. Claassen:
>> (This was started on the users list, but I am switching over to the 
>> dev list.)
>> 
>> I found the issue.  I have a bunch of small pages.  The COSDocument keeps a 
>> list of the streams that have been created.  The problem is that the 
>> currentPage in the ScratchFileBuffer is always in memory.  If there are 
>> 40,000 pages, then this will add up to 40,000 * the page size (4096) which 
>> is over 160,000,000.
>> 
>> So, now I am not sure how to deal with this.  Each page has a 
>> PDFPageContentStream, which creates a ScratchFileBuffer.
>> This ScratchFileBuffer is kept in the list of streams.  I could recompile 
>> with a smaller page size, but that will only cut the problem by a 
>> percentage.  Does anyone think it may be possible to change this to not 
>> maintain the list of streams?  Or maybe clear the currentPage byte array for 
>> the items in the list?
>> 
>> I am willing to do some work on this, but a little guidance (or realism) 
>> would be helpful before I get too deep into this.
>> 
>> Thanks,
>> 
>> Mark Claassen
>> Senior Software Engineer
>> 
>> Donnell Systems, Inc.
>> 130 South Main Street
>> Leighton Plaza Suite 375
>> South Bend, IN  46601
>> E-mail: mailto:[email protected]
>> Voice: (574)232-3784
>> Fax: (574)232-4014
>> 
>> Disclaimer:
>> The opinions provided herein do not necessarily state or reflect those 
>> of Donnell Systems, Inc.(DSI). DSI makes no warranty for and assumes 
>> no legal liability or responsibility for the posting.
>> -----Original Message-----
>> From: Mark A. Claassen <[email protected]>
>> Sent: Wednesday, June 9, 2021 4:53 PM
>> To: [email protected]
>> Subject: [Possible Spam] RE: PDF Memory issue
>> Importance: Low
>> 
>> In looking at this further, it seems that the ScratchFileBuffer.close method 
>> is only called when the document is closed.  ScratchFileBuffer.clear is 
>> never called.
>> 
>> These are the only places where the pageHandler.markPagesAsFree is called.  
>> I believe this is the issue, since markPagesAsFree is never called, this 
>> content just keeps building up until the document is closed.
>> 
>> Any guidance would be greatly appreciated.  I can't seem to find a 
>> configuration work around for this issue.
>> 
>> Mark Claassen
>> Senior Software Engineer
>> 
>> Donnell Systems, Inc.
>> 130 South Main Street
>> Leighton Plaza Suite 375
>> South Bend, IN  46601
>> E-mail: mailto:[email protected]
>> Voice: (574)232-3784
>> Fax: (574)232-4014
>> 
>> Disclaimer:
>> The opinions provided herein do not necessarily state or reflect those of 
>> Donnell Systems, Inc.(DSI). DSI makes no warranty for and assumes no legal 
>> liability or responsibility for the posting.
>> 
>> 
>> -----Original Message-----
>> From: Mark A. Claassen <[email protected]>
>> Sent: Wednesday, June 9, 2021 1:39 PM
>> To: [email protected]
>> Subject: [Possible Spam] PDF Memory issue
>> Importance: Low
>> 
>> Hi.  Thanks for your time.
>> 
>> I am using PDF box and am having trouble creating large PDFS (50,000+ 
>> pages).  The heap size of the process is capped, but with the temp file 
>> active (which I can see being created) I didn't think this would matter.
>> 
>> Here is what I am doing in a very condensed form:
>>      MEMORY_SETTING = MemoryUsageSetting.setupTempFileOnly();
>>      PDDocument pdf = new PDDocument(MEMORY_SETTING);
>>      
>>      for (...) {
>>              String text = [generate page text]
>>              PDPage page = new PDPage(PDRectangle.LETTER);
>>              try (PDPageContentStream contentStream = new 
>> PDPageContentStream(doc, page, 
>> PDPageContentStream.AppendMode.OVERWRITE, false)) {
>>                      
>>                      contentStream.endText();
>>                      doc.addPage(page);
>>      }
>> 
>> When I do a heap dump, I see over 100 MG of memory taken by 42,000 
>> instances of ScratchFileBuffer.currentPage
>> 
>> Is there something I am going wrong here?  Or is this a bug?  It seems like 
>> I must be doing something wrong / forgetting to do something, since this is 
>> a problem in 2 and 3-RC1.
>> 
>> Thanks again,
>> 
>> Mark Claassen
>> Senior Software Engineer
>> 
>> Donnell Systems, Inc.
>> 130 South Main Street
>> Leighton Plaza Suite 375
>> South Bend, IN  46601
>> E-mail: mailto:[email protected]
>> Voice: (574)232-3784
>> Fax: (574)232-4014
>> 
>> Disclaimer:
>> The opinions provided herein do not necessarily state or reflect those of 
>> Donnell Systems, Inc.(DSI). DSI makes no warranty for and assumes no legal 
>> liability or responsibility for the posting.
>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected]
>> For additional commands, e-mail: [email protected]
>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected]
>> For additional commands, e-mail: [email protected]
>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected]
>> For additional commands, e-mail: [email protected]
>> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected] For additional 
> commands, e-mail: [email protected]
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
> 

Attachment: smime.p7s
Description: S/MIME cryptographic signature

Reply via email to