[jira] [Commented] (PDFBOX-5809) PDDocument#importPage slowed down by factor 1300

Tilman Hausherr (Jira) Mon, 29 Apr 2024 04:18:04 -0700


    [ 
https://issues.apache.org/jira/browse/PDFBOX-5809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17841950#comment-17841950
 ]


Tilman Hausherr commented on PDFBOX-5809:
-----------------------------------------

Re changing the original document, I meant changing its representation in 
memory, I want to avoid this, because it's a surprising behavior. The physical 
file would only be changed if saved.

The "why so much faster" in 2.0.X is because of 
{{setHighestImportedObjectNumber()}} which is a call done only in 3.0.X to fix 
a nasty problem (broken documents) when combining documents. This problem 
happens only in 3.0.X because of the on-demand parser or because of 
ObjectStream compression when saving. The call is slow with your document 
because of the page references in the annotations and in the beads, which in 
turn references other pages so the call ends up crawling through the whole 
document.

Is this likely to get fixed with the release of 3.0.3? I don't know.

Alternative: remove all annotations and page /B entries before splitting. Or 
try PDFSam which is based on a PDFBox fork.

> PDDocument#importPage slowed down by factor 1300
> ------------------------------------------------
>
>                 Key: PDFBOX-5809
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-5809
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Utilities
>    Affects Versions: 2.0.31, 3.0.2 PDFBox
>            Reporter: Marcus Korinth
>            Priority: Major
>             Fix For: 2.0.32, 4.0.0, 3.0.3 PDFBox
>
>         Attachments: image-2024-04-27-18-50-19-199.png
>
>
> We are using the *PDDocument#importPage* Method in our own splitter where we 
> split pages from a _SourceDocument_ to a _TargetDocument_. In order to do so 
> we first extract the page by using the following code:
> {code:java}
> final PDPage sourcePage = sourceDocument.getPage(pageNumber);
> {code}
> Immediatly afterwards we are calling:
> {code:java}
> final PDPage targetPage = targetDocument.importPage(sourcePage);
> {code}
> This approach worked just fine with *pdfbox 2.0.26*.
> We decided to upgrade to version *3.0.2* since it takles a lot of the 
> problems.
> Unfortunately the *PDDocument#importPage* method slowed down by around 1300 
> times. In Version *2.0.26* it took 15ms in an average. With the latest 
> *3.0.2* it takes 20000 ms in average. That is a huge deal breaker as we 
> usually have to split documents which have several thousand pages.
> Note: The same applies when using *PDDocument#addPage*.
> Note: The problem does not appear in *3.0.1*. But we can't use that since it 
> has other major problems which breaks our application.
> I have prepared an example document with which you can replicate the issue. 
> Due to the file size limitation I had to prepare a WeTransfer-Link for you: 
> https://we.tl/t-lfN2wz7cAs



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Commented] (PDFBOX-5809) PDDocument#importPage slowed down by factor 1300

Reply via email to