[ 
https://issues.apache.org/jira/browse/PDFBOX-5068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17346116#comment-17346116
 ] 

robo commented on PDFBOX-5068:
------------------------------

[^issue5068.patch]

Dear PDFBox-Community,

I implemented a solution for the present memory-problem, which you can find in 
the attached git-patch. Here is a short description of the problem and how I 
proceeded (more details below):

1. Currently, when incrementally saving a PDF, PDFBox loads all COS Objects 
into memory (see COSWriter::prepareIncrement). This has the result that the 
memory usage increases in most cases linearly with the size of the to be saved 
PDF, although only small portion of the COS Objects is actually needed. 
Furthermore, this kind of “eager” loading of the COS Object has in some cases 
the undesirable side effect that the signing may take an unexpectedly long time 
(I listed an example of an 20MB PDF, which takes more than 12s).

2. The COSWriter::prepareIncrement method, again, makes use of two HashMaps two 
save the keys and COSObjects themselves in a bidirectional fashion. This 
bidirectional usage of the maps happens more than once within the COSWriter 
class. Thus, I first extracted this duplicated code into an 
EagerCOSWriterObjectStorage class (sub-typing an abstract 
COSWriterObjectStorage).

3. Finally, I implemented a LazyCOSWriterObjectStorage that retrieves the COS 
Objects only when they are actually required and tested it with different PDF 
files (using CreateVisibleSignature2.java). The results are roughly as follows:

=======================

1.3GB PDF (text)
*{color:#de350b}eager 16.4s (-Xmx1600M){color}* [requires >= -Xmx1600M on my PC]
lazy 8.3s (-Xmx1600M) 9.4s (-Xmx85M) [requires >= -Xmx85M]

200m PDF (scanned book, no OCR)
eager 7.6 (-Xmx230M) [requires >= -Xmx230M]
lazy 7.0 (-Xmx230M) 7.6 (-Xmx25M)

[https://www.bergundtal.ch/Data/Inhalte/PDF_BT_Jahreskatalog_2018_low.pdf]
50m PDF (text)
eager 1.8 (-Xmx85M) [requires -Xmx >= 85M]
lazy 1.4 (-Xmx85M) 1.6 (-Xmx25M)

[https://crossasia-books.ub.uni-heidelberg.de/xasia/reader/download/506/506-42-86246-2-10-20190822.pdf]
21M PDF (text)
*{color:#de350b}eager 13s (-Xmx130M){color}* [requires >= -Xmx130M]
lazy 1.4s (-Xmx130M) 1.6s (-Xmx70M) [requires >= -Xmx70M]

[minimum.pdf, attached to this JIRA Case]
31kb PDF (text)
eager 0.9 (-Xmx25M) 
lazy 0.9 (-Xmx25M)

=======================

More technical details and points of discussion:

• I was only aiming to provide a solution without refactoring to much of the 
current code. EagerCOSWriterObjectStorage should 1:1 replicate the behavior as 
it was before. The problem is that the COS-related code seems to be hardly 
maintainable and small changes may have unexpected results. One problem, for 
example, was that once a COSObject is de-referenced (COSObject::getObject), it 
is cached inside the COSObject instance, presumably for reasons of optimization 
but also, because COSObjects (de-referenced or not) are compared by reference 
(instance equality) and not semantically. This lack of equals/hashcode 
implementation inside COSObject was especially problematic, since COSObjects 
are used as keys in several HashMaps/HashSets. First I assumed, that a 
COSObjectKey can be used to identify a COSObject (there is also a 
COSObject::getKey). However, this was not case, or at least in the scope of 
COSWriter and the signing process. Furthermore, the key of the COSObject was 
not always set, which is yet another problem. Since I am still not sure how to 
compare to COSObjects semantically (that is in the way it is currently 
implemented and used), I eventually solved this problem by implementing a 
COSObject::getObjectWithoutCaching and by storing a reference to the COSObject 
in de-referenced object, rather than the other way around. A further problems 
is, that the state of many COS related classes is only defined at run-time and 
not at instance creation. This has also the result that the code is hard to 
test (partially, it seems, only integration tests are possible). I know that 
the code base is very old. What I write here should also by no means by 
understood as a critique. In contrary, my intention is only to document the 
problems I encountered and how it was rather complicated (at least for me as an 
outsider not familiar with the code) to make a change, which I expected should 
be much simpler.
 
 • Please consider the attached patch as a work in progress. I did not include 
the partial tests, on which the COSWriterObjectStorage is based on, and also 
did not add any documentation yet. I will add both the documentation and tests 
after the final clean up. Since everything appears to be working fine with the 
LazyCOSWriterObjectStorage only, it would be great if the EagerCOSObjectStorage 
could be dumped altogether. This would also have the advantage that there is no 
need to inject the type of COSWriterObjectStorage from the outside and further 
add complexity to the interface (which is why I have not implemented this part 
fully yet). However, I am not too sure whether the current tests actually cover 
most of the cases. There are also many things that could be improved, but would 
probably also induce more far-reaching changes. For example, it would be great 
if the COSDocument could be set at construction time. Furthermore, 
COSObject::getObject is still called from some other places in the code. Thus, 
some caching still takes place. Finally, there is also a COSObjectPool that 
already encapsulates a bidirectional key-object map. However, it did not really 
fit what I intended to do. Still, both could be possibly reconciled.
 
 • Anyhow, I think that resolving this issue would be a substantial 
improvement. Generally, I am skeptical of optimizations, but the present 
memory-problem touches not only edge cases. Actually, what I did here was not 
an optimization, but removing an apparently unnecessary optimization (caching), 
which caused the performance and memory problems in the first place.

I am looking forward to your suggestions and comments.

> OutOfMemory while signing large documents - continued
> -----------------------------------------------------
>
>                 Key: PDFBOX-5068
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-5068
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: Signing
>    Affects Versions: 2.0.23
>            Reporter: Ralf Hauser
>            Priority: Major
>         Attachments: RandomAccessReadBufferDiag.java, issue5068.patch, 
> minimum.pdf
>
>
> Continuation of PDFBOX-2512
>  
> in COSWriter.prepareIncrement(), for the test case 
> cosDoc.getXrefTable().keySet() has size 5925. For each of thes keys, 
> cosDoc.getObjectFromPool() gets an object that is not just referencing some 
> part of the input document, but duplicates it (which is unavoidable in the 
> case where they are decompressed with FlateFilter - albeit this could 
> possibly be done "lazy")
> -Xmx20m  746/5925
>  -Xmx25m 1615/5925
>  -Xmx30m 2800/5925
>  -Xmx40m 3872/5925
>  -Xmx55m 5773/5925
> With 60m, it gets them all, but dies later with less telling
>    java.lang.OutOfMemoryError: GC overhead limit exceeded
>  
> This assumes the patch of PDFBOX-5067 already in place - or using 
> CreateVisibleSignature2.java as starting point



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to