Hi again. I've now pulled the last commit from the 3.0 branch
----------------------------------------------------------------- commit 03218f9ff0439e7e3c9f10c5649a09b32e1817cf (HEAD -> 3.0, upstream/3.0) Author: Andreas Lehmkühler <[email protected]> Date: Wed Feb 4 18:08:45 2026 +0000 revert 3nd release attempt due to a possible regression git-svn-id: https://svn.apache.org/repos/asf/pdfbox/branches/3.0@1931714 13f79535-47bb-0310-9956-ffa450edef68 ----------------------------------------------------------------- Then I moved the execution of the reset function: ----------------------------------------------------------------- public void addPage(PDPage page) { // reset imported object keys to avoid overlapping object numbers page.getCOSObject().resetImportedObjectKeys(); getPages().add(page); } ----------------------------------------------------------------- Our tooling has 2 main focuses: extracting text and optimizing PDFs (quicker / safer render and smaller files) The function that stopped working for us was the optimization because we rewrote the PDF multiple times in order to find things that could be problematic to render on some systems or reduce images that are larger than the rendered. Lastly, I ran all our tests. We have a small batch that I run for every change of 150+ files. Just for sanity. Then I have 3 batches I run before a release: 1. Check with problematic PDFs (270 files). This one I run and render images with Poppler, PDF.js, Chromium, and SIPS (iOS). This is so we can verify that the PDF will render correctly on Android, iOS, and the web. 2. Size check, we run about 20 days of new papers that are usually sent to print to see how much we can reduce them in size. Usually, we can reduce them by about 88%. 3. Large test (50k pages), optimize and render with PDFBox to see that we don't introduce any artifacts. The result of our run was 1. After visual review, I found that no new issues were found; it looked like a normal run. 2. Files became slightly larger, but within the margin of error, no visual changes. 3. Slight visual changes, but after visual review, I found no noticeable difference between the rendered images. So, moving the resetImportedObjectKeys functionality back to addPage, we now have a version without any regression. Now the question remains if we want to do this breaking change? I can review our code and replace a couple of addPage with importPage, but others might run into the same issue. Best regards Daniel On Wed, Feb 4, 2026 at 8:59 PM Daniel Persson <[email protected]> wrote: > Hi. > > I think I've found the change that created a regression for me. My code > uses only addPage; nowhere do I import from another document. Usually, I > create a new PDPage object, add a content stream, and clone resources when > I reuse data from another document. And one difference in the code is where > we reset the object keys. So, a fix that would solve my problem and still > fix PDFBox-5752 would be to move the reset logic. > > ------------------------------------------- > public void addPage(PDPage page) > { > page.getCOSObject().resetImportedObjectKeys(); > getPages().add(page); > } > ------------------------------------------- > > Works for me. Instead of having it before the addPage call in the > importPage function. > > ------------------------------------------- > public PDPage importPage(PDPage page) throws IOException > { > PDPage importedPage = new PDPage(new > COSDictionary(page.getCOSObject()), resourceCache); > importedPage.getCOSObject().removeItem(COSName.PARENT); > PDStream dest = new PDStream(this, page.getContents(), > COSName.FLATE_DECODE); > importedPage.setContents(dest); > // reset imported object keys to avoid overlapping object numbers > importedPage.getCOSObject().resetImportedObjectKeys(); > addPage(importedPage); > importedPage.setCropBox(new > PDRectangle(page.getCropBox().getCOSArray())); > importedPage.setMediaBox(new > PDRectangle(page.getMediaBox().getCOSArray())); > importedPage.setRotation(page.getRotation()); > if (page.getResources() != null && > !page.getCOSObject().containsKey(COSName.RESOURCES)) > { > LOG.warn("inherited resources of source document are not > imported to destination page"); > LOG.warn("call importedPage.setResources(page.getResources()) > to do this"); > } > return importedPage; > } > ------------------------------------------- > > Maybe this change was intentional. But it will at least break code like > > ------------------------------------------------ > PDPage newPage = new PDPage(); > COSBase base = > cloneUtility.cloneForNewDocument(page.getResources()); > newPage.setResources(new PDResources((COSDictionary) base)); > newPage.setMediaBox(page.getMediaBox()); > newPage.setCropBox(page.getCropBox()); > newPage.setTrimBox(page.getTrimBox()); > newPage.setRotation(page.getRotation()); > > List<PDAnnotation> list = new ArrayList<>(); > for(PDAnnotation annotation : page.getAnnotations()) { > COSBase cloned = > cloneUtility.cloneForNewDocument(annotation); > list.add(PDAnnotation.createAnnotation(cloned)); > } > if(list.size() > 0) { > newPage.setAnnotations(list); > } > > List<PDStream> newStream = new ArrayList<>(); > Iterator<PDStream> it = page.getContentStreams(); > while (it.hasNext()) { > PDStream stream = it.next(); > newStream.add(stream); > } > newPage.setContents(newStream); > newDoc.addPage(newPage); > ------------------------------------------------ > > Best regards > Daniel > > On Wed, Feb 4, 2026 at 7:19 PM Andreas Lehmkühler <[email protected]> > wrote: > >> Are you using the import page feature as the mentioned commit fixes an >> issue when importing pages containing objects with overlapping object >> numbers. Other scenarios are most likely not affected. >> >> >> Am 04.02.26 um 13:19 schrieb Daniel Persson: >> > Hi Andreas >> > >> > You are right, the commit that introduced the error is: >> > ---------------------------------------------------- >> > commit 41c3a431e21c31a9cf6d6dec4b47a126bac2996f (HEAD) >> > Author: Andreas Lehmkühler <[email protected]> >> > Date: Tue Dec 16 07:20:09 2025 +0000 >> > >> > PDFBOX-6036: avoid overlapping object keys when importing pages >> from >> > another pdf >> > >> > git-svn-id: >> https://svn.apache.org/repos/asf/pdfbox/branches/3.0@1930616 >> > 13f79535-47bb-0310-9956-ffa450edef68 >> > ---------------------------------------------------- >> > >> > I still don't like the new implementation of COSWriterObjectStream. The >> > original thread-safe implementation is simpler to read and more correct, >> > but I understand if you want the change for performance reasons. >> Removing >> > the synchronization seems like the wrong way to do this. And looking at >> the >> > numbers, the original implementation handles memory and time better in >> my >> > comparisons. >> > >> > == Test 3.0.6 == >> > iter=0 wall_ms=376.855 cpu_ms=322.945 alloc_mb=92.806 >> > heap_before_mb=216.448 heap_after_mb=141.633 heap_delta_mb=-74.814 >> > gc_count_delta=1 gc_time_ms_delta=5 >> > iter=1 wall_ms=308.536 cpu_ms=302.087 alloc_mb=92.774 >> > heap_before_mb=141.633 heap_after_mb=233.633 heap_delta_mb=92.000 >> > gc_count_delta=0 gc_time_ms_delta=0 >> > iter=2 wall_ms=327.187 cpu_ms=316.977 alloc_mb=92.774 >> > heap_before_mb=233.633 heap_after_mb=327.635 heap_delta_mb=94.002 >> > gc_count_delta=0 gc_time_ms_delta=0 >> > iter=3 wall_ms=396.564 cpu_ms=381.406 alloc_mb=92.774 >> > heap_before_mb=327.635 heap_after_mb=113.293 heap_delta_mb=-214.343 >> > gc_count_delta=1 gc_time_ms_delta=5 >> > iter=4 wall_ms=470.828 cpu_ms=469.447 alloc_mb=92.774 >> > heap_before_mb=113.293 heap_after_mb=205.293 heap_delta_mb=92.000 >> > gc_count_delta=0 gc_time_ms_delta=0 >> > iter=5 wall_ms=528.677 cpu_ms=523.818 alloc_mb=94.127 >> > heap_before_mb=205.293 heap_after_mb=301.293 heap_delta_mb=96.000 >> > gc_count_delta=0 gc_time_ms_delta=0 >> > iter=6 wall_ms=543.075 cpu_ms=533.379 alloc_mb=92.886 >> > heap_before_mb=301.293 heap_after_mb=393.293 heap_delta_mb=92.000 >> > gc_count_delta=0 gc_time_ms_delta=0 >> > iter=7 wall_ms=489.066 cpu_ms=483.486 alloc_mb=92.886 >> > heap_before_mb=393.293 heap_after_mb=485.293 heap_delta_mb=92.000 >> > gc_count_delta=0 gc_time_ms_delta=0 >> > iter=8 wall_ms=817.173 cpu_ms=512.633 alloc_mb=92.824 >> > heap_before_mb=487.291 heap_after_mb=148.949 heap_delta_mb=-338.342 >> > gc_count_delta=1 gc_time_ms_delta=4 >> > iter=9 wall_ms=897.804 cpu_ms=343.736 alloc_mb=92.824 >> > heap_before_mb=148.949 heap_after_mb=240.949 heap_delta_mb=92.000 >> > gc_count_delta=0 gc_time_ms_delta=0 >> > >> > == Test 3.0.7 == >> > iter=0 wall_ms=507.491 cpu_ms=501.767 alloc_mb=94.896 >> > heap_before_mb=195.869 heap_after_mb=170.142 heap_delta_mb=-25.726 >> > gc_count_delta=1 gc_time_ms_delta=6 >> > iter=1 wall_ms=495.749 cpu_ms=492.284 alloc_mb=94.861 >> > heap_before_mb=170.142 heap_after_mb=266.142 heap_delta_mb=96.000 >> > gc_count_delta=0 gc_time_ms_delta=0 >> > iter=2 wall_ms=437.024 cpu_ms=435.485 alloc_mb=94.861 >> > heap_before_mb=266.142 heap_after_mb=360.140 heap_delta_mb=93.998 >> > gc_count_delta=0 gc_time_ms_delta=0 >> > iter=3 wall_ms=478.096 cpu_ms=465.265 alloc_mb=94.861 >> > heap_before_mb=360.140 heap_after_mb=127.227 heap_delta_mb=-232.913 >> > gc_count_delta=1 gc_time_ms_delta=5 >> > iter=4 wall_ms=1096.645 cpu_ms=509.049 alloc_mb=94.862 >> > heap_before_mb=127.227 heap_after_mb=221.229 heap_delta_mb=94.002 >> > gc_count_delta=0 gc_time_ms_delta=0 >> > iter=5 wall_ms=1049.944 cpu_ms=319.307 alloc_mb=94.863 >> > heap_before_mb=221.229 heap_after_mb=317.229 heap_delta_mb=96.000 >> > gc_count_delta=0 gc_time_ms_delta=0 >> > iter=6 wall_ms=1851.772 cpu_ms=343.559 alloc_mb=94.863 >> > heap_before_mb=317.229 heap_after_mb=411.227 heap_delta_mb=93.998 >> > gc_count_delta=0 gc_time_ms_delta=0 >> > iter=7 wall_ms=485.274 cpu_ms=345.465 alloc_mb=94.863 >> > heap_before_mb=411.227 heap_after_mb=116.257 heap_delta_mb=-294.970 >> > gc_count_delta=1 gc_time_ms_delta=4 >> > iter=8 wall_ms=407.939 cpu_ms=405.857 alloc_mb=94.802 >> > heap_before_mb=116.257 heap_after_mb=210.259 heap_delta_mb=94.002 >> > gc_count_delta=0 gc_time_ms_delta=0 >> > iter=9 wall_ms=689.281 cpu_ms=380.940 alloc_mb=94.801 >> > heap_before_mb=210.259 heap_after_mb=304.257 heap_delta_mb=93.998 >> > gc_count_delta=0 gc_time_ms_delta=0 >> > >> > But maybe you've seen another trend using another profiling tool. >> > >> > The results above are created by a ChatGPU testing tool that warms the >> code >> > 5 times and then tests it 10 times while outputting the result. >> > PDFBox code I ran was loading a PDF and saving the document without any >> > changes. >> > >> > Full code here: >> > >> https://github.com/kalaspuffar/PDFBoxTestBase/blob/main/src/main/java/TestingPerformance.java >> > >> > Best regards >> > Daniel >> > >> > On Wed, Feb 4, 2026 at 8:28 AM Andreas Lehmkühler <[email protected]> >> wrote: >> > >> >> Hmmm, the first commit introduced a regression which ended up in >> crashes >> >> and the second one fixed the regression. The whole change was about >> >> compressed object streams which shall not contain already compressed >> >> objects such as content streams using FlateFilter as filter. Saying >> >> that, I'm hesitant to believe that your issue is related to those >> >> changes. Maybe another commit between those commits is the root cause. >> >> >> >> Without some sample code it is fishing in troubled waters. >> >> >> >> >> >> Am 03.02.26 um 18:17 schrieb Daniel Persson: >> >>> Hi Andreas >> >>> >> >>> It's in 3.0.7. I ran a bunch of commits in order to figure out when >> the >> >>> issue was introduced. >> >>> >> >>> 87011ade3 fail >> >>> f3bb496975ee6ca6ae98c00c0e50cfc4375a3f8a fail >> >>> 7ee6d390278fd0b06668ec65ede14810c6075ec9 crash >> >>> 26283807ad crash >> >>> dd76acd546 crash >> >>> 2fef081c714d8c6524aab118e2bfec7cf379e45a crash >> >>> 08bc6fdd5200966309787a8188c3d7d5827b170a crash >> >>> 3800af7bc5d8f08af99a653b37f8e4cd67bf1659 crash >> >>> 1d4ae695a83c33999bda78a1d9f8c43512940965 crash >> >>> 1ac4a24f8f7dfd08924ef9645246656ad3b9b33a crash >> >>> 994b87e2b4d30ac2435cff9fe20ecdfc6ab1b916 crash >> >>> f82d2224a047bc642f1d38ff18360c61eaf9cccf success >> >>> d7d34f25cec7f4884e8f599ed620b2c3c704017b success >> >>> 045d17604640a68b798027300f690f0af2b1a95d success >> >>> cdffe505e8bdeb5810456c1e6d9df61c7e2aab85 success >> >>> 304ab0027d18fc8df5638f39bac033a55769dc4e success >> >>> 222fb5f3b32fdb20f11107919700a80d1dcc130e success >> >>> >> >>> Never commits on top. >> >>> >> >>> So the two pivital commits we have is: >> >>> >> >>> -------------------------------------------------- >> >>> commit 994b87e2b4d30ac2435cff9fe20ecdfc6ab1b916 (head) >> >>> Author: Andreas Lehmkühler <[email protected]> >> >>> Date: Sat Dec 6 12:32:10 2025 +0000 >> >>> >> >>> PDFBOX-5169: reduce the memory footprint by reusing the internal >> >> byte >> >>> array instead of copying it >> >>> >> >>> git-svn-id: >> >> https://svn.apache.org/repos/asf/pdfbox/branches/3.0@1930285 >> >>> 13f79535-47bb-0310-9956-ffa450edef68 >> >>> -------------------------------------------------- >> >>> After this one the created PDF could not be rendered in poppler. >> >>> >> >>> Next we have this: >> >>> -------------------------------------------------- >> >>> commit f3bb496975ee6ca6ae98c00c0e50cfc4375a3f8a (HEAD) >> >>> Author: Andreas Lehmkühler <[email protected]> >> >>> Date: Sat Jan 10 11:25:01 2026 +0000 >> >>> >> >>> PDFBOX-6142: take the size of the stream into account when >> accessing >> >>> the data of the underlying byte array >> >>> >> >>> git-svn-id: >> >> https://svn.apache.org/repos/asf/pdfbox/branches/3.0@1931215 >> >>> 13f79535-47bb-0310-9956-ffa450edef68 >> >>> -------------------------------------------------- >> >>> This one stores COSDictionary instead of COSStream for the contents of >> >> the >> >>> document sometimes. >> >>> >> >>> Best regards >> >>> Daniel >> >>> >> >>> >> >>> On Tue, Feb 3, 2026 at 4:34 PM Andreas Lehmkühler <[email protected]> >> >> wrote: >> >>> >> >>>> >> >>>> >> >>>> Am 03.02.26 um 15:46 schrieb Daniel Persson: >> >>>>> Hi again. >> >>>>> >> >>>>> Sorry to say that this version is still not great. >> >>>> Thanks for the feedback >> >>>> >> >>>>> >> >>>>> -1. >> >>>>> >> >>>>> I have not figured out what is going on because we do a lot of >> >>>> operations, >> >>>>> but when I process a file with multiple pages (48) and do all our >> >>>>> operations, and then save it again. I get a bunch of blank pages. >> >>>>> So the first 38 pages don't save COSStream for the Content stream; >> it >> >>>> uses >> >>>>> a COSDictionary with the length and filter. >> >>>>> >> >>>>> Filter: FlateDecode >> >>>>> Length: 7820 >> >>>>> >> >>>>> So the first 38 pages are blank, and the last 10 are stored >> correctly. >> >>>> This >> >>>>> is a change from the previous version of PDFBox. >> >>>>> >> >>>>> Trying to create a minimal critical example code to show this issue. >> >>>>> Sending this email if someone might have an idea why I see this. >> >>>> Is this new in 3.0.7? >> >>>> >> >>>> >> >>>>> >> >>>>> Best regards >> >>>>> Daniel >> >>>>> >> >>>>> On Mon, Feb 2, 2026 at 6:14 PM Andreas Lehmkühler <[email protected] >> > >> >>>> wrote: >> >>>>> >> >>>>>> Hi, >> >>>>>> >> >>>>>> a candidate for the PDFBox 3.0.7 release is available at: >> >>>>>> >> >>>>>> https://dist.apache.org/repos/dist/dev/pdfbox/3.0.7/ >> >>>>>> >> >>>>>> The release candidate is a zip archive of the sources in: >> >>>>>> >> >>>>>> https://svn.apache.org/repos/asf/pdfbox/tags/3.0.7/ >> >>>>>> >> >>>>>> The SHA-512 checksum of the archive is >> >>>>>> >> >>>>>> >> >>>> >> >> >> bf863c69225821d93d4a4cf86b4dae59c93211651ca72bfbf5da7dfcf6a480b3d7b8c0ea672adbba789afd0e79481ec8883da15e29c5fa31cba564aa8cfc89d0. >> >>>>>> >> >>>>>> Please vote on releasing this package as Apache PDFBox 3.0.7. >> >>>>>> The vote is open for the next 72 hours and passes if a majority of >> at >> >>>>>> least three +1 PDFBox PMC votes are cast. >> >>>>>> >> >>>>>> [ ] +1 Release this package as Apache PDFBox 3.0.7 >> >>>>>> [ ] -1 Do not release this package because... >> >>>>>> >> >>>>>> >> >>>>>> Here is my +1 >> >>>>>> >> >>>>>> Andreas >> >>>>>> >> >>>>>> >> --------------------------------------------------------------------- >> >>>>>> To unsubscribe, e-mail: [email protected] >> >>>>>> For additional commands, e-mail: [email protected] >> >>>>>> >> >>>>>> >> >>>>> >> >>>> >> >>>> >> >>>> --------------------------------------------------------------------- >> >>>> To unsubscribe, e-mail: [email protected] >> >>>> For additional commands, e-mail: [email protected] >> >>>> >> >>>> >> >>> >> >> >> >> >> >> --------------------------------------------------------------------- >> >> To unsubscribe, e-mail: [email protected] >> >> For additional commands, e-mail: [email protected] >> >> >> >> >> > >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: [email protected] >> For additional commands, e-mail: [email protected] >> >>
