[ https://issues.apache.org/jira/browse/PDFBOX-5950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17925503#comment-17925503 ]
Tilman Hausherr edited comment on PDFBOX-5950 at 2/10/25 8:06 AM: ------------------------------------------------------------------ So I tried your change with the trunk (and likely with 3.0) by running java -jar pdfbox-app-4.0.0-SNAPSHOT.jar merge ComSquare1.pdf Ghostscript1.pdf res.pdf then start PDFDebugger, choose "view", "show internal structure" and then look for {{Info/ImPDF/Images/Kids/[0]}}, there's an image and it's missing. With 2.0 I get an exception by merging from the command line like this: java -jar pdfbox-app-2.0.34-SNAPSHOT.jar PDFMerger ComSquare1.pdf Ghostscript1.pdf res20.pdf {noformat} Exception in thread "main" java.io.IOException: COSStream has been closed and cannot be read. Perhaps its enclosing PDDocument has been closed? at org.apache.pdfbox.cos.COSStream.checkClosed(COSStream.java:83) at org.apache.pdfbox.cos.COSStream.createRawInputStream(COSStream.java:133) at org.apache.pdfbox.pdfwriter.COSWriter.visitFromStream(COSWriter.java:1290) at org.apache.pdfbox.cos.COSStream.accept(COSStream.java:416) at org.apache.pdfbox.cos.COSObject.accept(COSObject.java:195) at org.apache.pdfbox.pdfwriter.COSWriter.doWriteObject(COSWriter.java:570) at org.apache.pdfbox.pdfwriter.COSWriter.doWriteObjects(COSWriter.java:496) at org.apache.pdfbox.pdfwriter.COSWriter.doWriteBody(COSWriter.java:480) at org.apache.pdfbox.pdfwriter.COSWriter.visitFromDocument(COSWriter.java:1184) at org.apache.pdfbox.cos.COSDocument.accept(COSDocument.java:455) at org.apache.pdfbox.pdfwriter.COSWriter.write(COSWriter.java:1457) at org.apache.pdfbox.pdfwriter.COSWriter.write(COSWriter.java:1344) at org.apache.pdfbox.pdmodel.PDDocument.save(PDDocument.java:1381) at org.apache.pdfbox.pdmodel.PDDocument.save(PDDocument.java:1353) at org.apache.pdfbox.pdmodel.PDDocument.save(PDDocument.java:1337) at org.apache.pdfbox.multipdf.PDFMergerUtility.legacyMergeDocuments(PDFMergerUtility.java:488) at org.apache.pdfbox.multipdf.PDFMergerUtility.mergeDocuments(PDFMergerUtility.java:349) at org.apache.pdfbox.tools.PDFMerger.merge(PDFMerger.java:70) at org.apache.pdfbox.tools.PDFMerger.main(PDFMerger.java:49) at org.apache.pdfbox.tools.PDFBox.main(PDFBox.java:85) {noformat} The problem is related to this code: {code:java} PDDocumentInformation destInfo = destination.getDocumentInformation(); PDDocumentInformation srcInfo = source.getDocumentInformation(); mergeInto(srcInfo.getCOSObject(), destInfo.getCOSObject(), Collections.emptySet()); {code} {{mergeInto}} just calls {{setItem}} because no streams are expected here. However the comsquare file does have a stream. We could try to clone instead, but I wonder what other surprises may occur. was (Author: tilman): So I tried your change with the trunk (and likely with 3.0) by running java -jar pdfbox-app-4.0.0-SNAPSHOT.jar merge ComSquare1.pdf Ghostscript1.pdf res.pdf then start PDFDebugger, choose "view", "show internal structure" and then look for {{Info/ImPDF/Images/Kids/[0]}}, there's an image and it's missing. With 2.0 I get an exception by merging from the command line like this: java -jar pdfbox-app-2.0.34-SNAPSHOT.jar PDFMerger ComSquare1.pdf Ghostscript1.pdf res20.pdf {noformat} Exception in thread "main" java.io.IOException: COSStream has been closed and cannot be read. Perhaps its enclosing PDDocument has been closed? at org.apache.pdfbox.cos.COSStream.checkClosed(COSStream.java:83) at org.apache.pdfbox.cos.COSStream.createRawInputStream(COSStream.java:133) at org.apache.pdfbox.pdfwriter.COSWriter.visitFromStream(COSWriter.java:1290) at org.apache.pdfbox.cos.COSStream.accept(COSStream.java:416) at org.apache.pdfbox.cos.COSObject.accept(COSObject.java:195) at org.apache.pdfbox.pdfwriter.COSWriter.doWriteObject(COSWriter.java:570) at org.apache.pdfbox.pdfwriter.COSWriter.doWriteObjects(COSWriter.java:496) at org.apache.pdfbox.pdfwriter.COSWriter.doWriteBody(COSWriter.java:480) at org.apache.pdfbox.pdfwriter.COSWriter.visitFromDocument(COSWriter.java:1184) at org.apache.pdfbox.cos.COSDocument.accept(COSDocument.java:455) at org.apache.pdfbox.pdfwriter.COSWriter.write(COSWriter.java:1457) at org.apache.pdfbox.pdfwriter.COSWriter.write(COSWriter.java:1344) at org.apache.pdfbox.pdmodel.PDDocument.save(PDDocument.java:1381) at org.apache.pdfbox.pdmodel.PDDocument.save(PDDocument.java:1353) at org.apache.pdfbox.pdmodel.PDDocument.save(PDDocument.java:1337) at org.apache.pdfbox.multipdf.PDFMergerUtility.legacyMergeDocuments(PDFMergerUtility.java:488) at org.apache.pdfbox.multipdf.PDFMergerUtility.mergeDocuments(PDFMergerUtility.java:349) at org.apache.pdfbox.tools.PDFMerger.merge(PDFMerger.java:70) at org.apache.pdfbox.tools.PDFMerger.main(PDFMerger.java:49) at org.apache.pdfbox.tools.PDFBox.main(PDFBox.java:85) {noformat} > pdfbox PDFMergerUtility Potential OOM issue > -------------------------------------------- > > Key: PDFBOX-5950 > URL: https://issues.apache.org/jira/browse/PDFBOX-5950 > Project: PDFBox > Issue Type: Bug > Components: Utilities > Affects Versions: 2.0.32, 3.0.0 PDFBox > Environment: jdk11 > Reporter: asdpboy > Priority: Major > Attachments: after.png, before.png, oom.png > > > I have identified a potential bug in Apache PDFBox and would like to report > it. Below are the details: > > When there are a large number of sources (e.g., thousands), the `tobeclosed` > method will load the PDF document into memory. This may pose a risk of > Out-of-Memory (OOM) during the merge process. > > The following adjustments can be made, close the sourceDoc object immediately > . > > org.apache.pdfbox.multipdf.PDFMergerUtility#legacyMergeDocuments > {code:java} > for (Object sourceObject : sources) > { > PDDocument sourceDoc = null; > if (sourceObject instanceof File) > { > sourceDoc = PDDocument.load((File) sourceObject, > partitionedMemSetting); > } > else > { > sourceDoc = PDDocument.load((InputStream) sourceObject, > partitionedMemSetting); > } > try > { > appendDocument(destination, sourceDoc); > } > finally > { > IOUtils.closeAndLogException(sourceDoc, LOG, "PDDocument", null); > } > } > {code} > one of the oom cases > !oom.png! > Comparison of Memory Usage Before and After Modification (Merging a 16.8MB > File 200 Times, with JVM Heap Size Limit Set to 2GB) > Before Modification: An OutOfMemoryError (OOM) occurred after just over 1 > minute of operation. Due to insufficient heap memory, Full GC (Full Garbage > Collection) was triggered frequently, which can be observed from the CPU > usage curve on the left. > !before.png! > After Modification: The heap memory is now able to be collected normally > without causing an OOM. > !after.png! > -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org