Re: [Possible Spam] Re: PDF Memory issue

Martinez, Mel - 0441 - MITLL Thu, 10 Jun 2021 14:48:44 -0700

Ah.  Too bad.

Note that, if the byte arrays are immutable (or at least treated as such) and 
in a wrapper object (such as ByteBuffer) with, as I indicated, a proper 
.equals() and .hashcode() implementation, object pooling still can be effective.


Good luck!

Mel

Dr. Mel Martinez
[email protected]



> On Jun 10, 2021, at 4:57 PM, Mark A. Claassen <[email protected]> wrote:
> 
> Thanks for the tips.  I don't think they will help here, however.  The 4K 
> object that is being held is a byte array.
> 
> Thanks again,
> 
> Mark Claassen
> Senior Software Engineer
> 
> Donnell Systems, Inc.
> 130 South Main Street
> Leighton Plaza Suite 375
> South Bend, IN  46601
> E-mail: mailto:[email protected]
> Voice: (574)232-3784
> Fax: (574)232-4014
>   
> -------------------------------------------
> Confidentiality Notice: OCIESERVICE
> -------------------------------------------
> The contents of this e-mail message and any attachments are intended solely 
> for the addressee(s) named in this message. This communication is intended to 
> be and to remain confidential. If you are not the intended recipient of this 
> message, or if this message has been addressed to you in error, please 
> immediately alert the sender by reply e-mail and then delete this message and 
> its attachments. Do not deliver, distribute, copy, disclose the contents or 
> take any action in reliance upon the information contained in the 
> communication or any attachments.
> 
> 
> -----Original Message-----
> From: Martinez, Mel - 0441 - MITLL <[email protected]> 
> Sent: Thursday, June 10, 2021 4:28 PM
> To: [email protected]
> Cc: Martinez, Mel - 0441 - MITLL <[email protected]>
> Subject: Re: [Possible Spam] Re: PDF Memory issue
> Importance: Low
> 
> I haven’t looked at this particular code at all, but I’m guessing that a LOT 
> of the objects being referenced are strings — possibly identical strings?
> 
> It may be useful to implement object (string) pooling.    That can save a ton 
> of memory.
> 
> Do not use the built-in String.intern() function for this, though.   That is 
> limited and slow.    It’s better to build the string pool around something 
> like ConcurrentHashMap.putIfAbsent().
> 
> You then need to rewrite the code to do a pool check whenever new strings are 
> created / input.
> 
> 
> ConcurrentHashMap stringPool = new ConcurrentHashMap();  //<— do this once 
> and make it available to all your code, whether as a singleton or static.
> 
> String s = someStepThatCreatesOrInputsAString();
> 
> s = stringPool.putIfAbsent(s, s);   //<— add this step everywhere
> 
> 
> This imposes a very tiny lookup cost with every putIfAbsent() call but it’s 
> pretty small and benchmarks you can find on the ’net show its still way 
> faster than String.intern(), especially for large O(n).  The putIfAbsent() 
> call is atomic and this is perfectly thread safe.
> 
> The end result is that you can enforce that you will have only one copy of 
> any string in memory, regardless of how many references you might have of it.
> 
> For non-String objects, this can also be used but it’s important that the 
> object have proper .equals() and .hashcode() methods implemented.
> 
> I hope this suggestion is helpful.  This pattern saved me massive amounts of 
> memory in clients pulling data from cloud databases. 
> 
> If it doesn’t make sense to apply in this particular code, then hopefully it 
> will still prove a useful tip for someone somewhere else.
> 
> Cheers,
> 
> Mel
> 
> Dr. Mel Martinez
> [email protected]
> 
> 
> 
>> On Jun 10, 2021, at 1:07 PM, Mark A. Claassen <[email protected]> wrote:
>> 
>> Thanks for the reply.
>> 
>>> Why should the list not be kept? We need it for when the file is saved.
>> 
>> I need to study that code a bit more, there is a lot going on there that I 
>> don't yet understand.  What I was thinking was if there might be an 
>> alternative to keeping the stream object in memory, like storing the 
>> necessary metadata for it in a smaller structure.  
>> 
>> Maybe the stream is the perfect object for this.  However, at 4K or more a 
>> piece, and one per page, this scales at least linearly with the number of 
>> pages.  When dealing with "normal" documents, this is not an issue.  But 
>> when the number of pages gets large, this overhead is significant. 
>> 
>> We had someone try to create a PDF from a 25,000 page text source.  25,000 * 
>> 4K is 100 megabytes.  If it was possible to not maintain any data in the 
>> ScratchFileBuffer, it would scale a bit better.
>> 
>> Thanks again,
>> 
>> Mark Claassen
>> Senior Software Engineer
>> 
>> Donnell Systems, Inc.
>> 130 South Main Street
>> Leighton Plaza Suite 375
>> South Bend, IN  46601
>> E-mail: mailto:[email protected]
>> Voice: (574)232-3784
>> Fax: (574)232-4014
>> 
>> Disclaimer:
>> The opinions provided herein do not necessarily state or reflect 
>> those of Donnell Systems, Inc.(DSI). DSI makes no warranty for and 
>> assumes no legal liability or responsibility for the posting. 
>> 
>> 
>> -----Original Message-----
>> From: Tilman Hausherr <[email protected]> 
>> Sent: Thursday, June 10, 2021 12:02 PM
>> To: [email protected]
>> Subject: [Possible Spam] Re: PDF Memory issue
>> Importance: Low
>> 
>> Why should the list not be kept? We need it for when the file is saved.
>> 
>> Tilman
>> 
>> Am 10.06.2021 um 03:07 schrieb Mark A. Claassen:
>>> (This was started on the users list, but I am switching over to the 
>>> dev list.)
>>> 
>>> I found the issue.  I have a bunch of small pages.  The COSDocument keeps a 
>>> list of the streams that have been created.  The problem is that the 
>>> currentPage in the ScratchFileBuffer is always in memory.  If there are 
>>> 40,000 pages, then this will add up to 40,000 * the page size (4096) which 
>>> is over 160,000,000.
>>> 
>>> So, now I am not sure how to deal with this.  Each page has a 
>>> PDFPageContentStream, which creates a ScratchFileBuffer.
>>> This ScratchFileBuffer is kept in the list of streams.  I could recompile 
>>> with a smaller page size, but that will only cut the problem by a 
>>> percentage.  Does anyone think it may be possible to change this to not 
>>> maintain the list of streams?  Or maybe clear the currentPage byte array 
>>> for the items in the list?
>>> 
>>> I am willing to do some work on this, but a little guidance (or realism) 
>>> would be helpful before I get too deep into this.
>>> 
>>> Thanks,
>>> 
>>> Mark Claassen
>>> Senior Software Engineer
>>> 
>>> Donnell Systems, Inc.
>>> 130 South Main Street
>>> Leighton Plaza Suite 375
>>> South Bend, IN  46601
>>> E-mail: mailto:[email protected]
>>> Voice: (574)232-3784
>>> Fax: (574)232-4014
>>> 
>>> Disclaimer:
>>> The opinions provided herein do not necessarily state or reflect those 
>>> of Donnell Systems, Inc.(DSI). DSI makes no warranty for and assumes 
>>> no legal liability or responsibility for the posting.
>>> -----Original Message-----
>>> From: Mark A. Claassen <[email protected]>
>>> Sent: Wednesday, June 9, 2021 4:53 PM
>>> To: [email protected]
>>> Subject: [Possible Spam] RE: PDF Memory issue
>>> Importance: Low
>>> 
>>> In looking at this further, it seems that the ScratchFileBuffer.close 
>>> method is only called when the document is closed.  ScratchFileBuffer.clear 
>>> is never called.
>>> 
>>> These are the only places where the pageHandler.markPagesAsFree is called.  
>>> I believe this is the issue, since markPagesAsFree is never called, this 
>>> content just keeps building up until the document is closed.
>>> 
>>> Any guidance would be greatly appreciated.  I can't seem to find a 
>>> configuration work around for this issue.
>>> 
>>> Mark Claassen
>>> Senior Software Engineer
>>> 
>>> Donnell Systems, Inc.
>>> 130 South Main Street
>>> Leighton Plaza Suite 375
>>> South Bend, IN  46601
>>> E-mail: mailto:[email protected]
>>> Voice: (574)232-3784
>>> Fax: (574)232-4014
>>> 
>>> Disclaimer:
>>> The opinions provided herein do not necessarily state or reflect those of 
>>> Donnell Systems, Inc.(DSI). DSI makes no warranty for and assumes no legal 
>>> liability or responsibility for the posting.
>>> 
>>> 
>>> -----Original Message-----
>>> From: Mark A. Claassen <[email protected]>
>>> Sent: Wednesday, June 9, 2021 1:39 PM
>>> To: [email protected]
>>> Subject: [Possible Spam] PDF Memory issue
>>> Importance: Low
>>> 
>>> Hi.  Thanks for your time.
>>> 
>>> I am using PDF box and am having trouble creating large PDFS (50,000+ 
>>> pages).  The heap size of the process is capped, but with the temp file 
>>> active (which I can see being created) I didn't think this would matter.
>>> 
>>> Here is what I am doing in a very condensed form:
>>>     MEMORY_SETTING = MemoryUsageSetting.setupTempFileOnly();
>>>     PDDocument pdf = new PDDocument(MEMORY_SETTING);
>>>     
>>>     for (...) {
>>>             String text = [generate page text]
>>>             PDPage page = new PDPage(PDRectangle.LETTER);
>>>             try (PDPageContentStream contentStream = new 
>>> PDPageContentStream(doc, page, 
>>> PDPageContentStream.AppendMode.OVERWRITE, false)) {
>>>                     
>>>                     contentStream.endText();
>>>                     doc.addPage(page);
>>>     }
>>> 
>>> When I do a heap dump, I see over 100 MG of memory taken by 42,000 
>>> instances of ScratchFileBuffer.currentPage
>>> 
>>> Is there something I am going wrong here?  Or is this a bug?  It seems like 
>>> I must be doing something wrong / forgetting to do something, since this is 
>>> a problem in 2 and 3-RC1.
>>> 
>>> Thanks again,
>>> 
>>> Mark Claassen
>>> Senior Software Engineer
>>> 
>>> Donnell Systems, Inc.
>>> 130 South Main Street
>>> Leighton Plaza Suite 375
>>> South Bend, IN  46601
>>> E-mail: mailto:[email protected]
>>> Voice: (574)232-3784
>>> Fax: (574)232-4014
>>> 
>>> Disclaimer:
>>> The opinions provided herein do not necessarily state or reflect those of 
>>> Donnell Systems, Inc.(DSI). DSI makes no warranty for and assumes no legal 
>>> liability or responsibility for the posting.
>>> 
>>> 
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: [email protected]
>>> For additional commands, e-mail: [email protected]
>>> 
>>> 
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: [email protected]
>>> For additional commands, e-mail: [email protected]
>>> 
>>> 
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: [email protected]
>>> For additional commands, e-mail: [email protected]
>>> 
>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected] For additional 
>> commands, e-mail: [email protected]
>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected]
>> For additional commands, e-mail: [email protected]
>> 
>

smime.p7s
Description: S/MIME cryptographic signature

Re: [Possible Spam] Re: PDF Memory issue

Reply via email to