DO NOT REPLY [Bug 51664] [PATCH] Tagged PDF performance improvement + tests

bugzilla Tue, 16 Aug 2011 07:39:47 -0700

https://issues.apache.org/bugzilla/show_bug.cgi?id=51664


--- Comment #3 from Jeremias Maerki <[email protected]> 2011-08-16 14:39:23 
UTC ---
Just some background on the problem:

It was found that enabling accessibility (tagged PDF) decreases PDF production
performance considerably.

I've profiled FOP with an FO file (about 10 pages). I ran both FO->PDF and
FO->IF->PDF scenarios to isolate the bulk of the "lost" time. It turns out that
the FO-IF stage doesn't lose a lot of performance due to the additional work.
So I concentrated on IF->PDF.

The VisualVM profiler highlighted PDFDocument.getWriterFor() and
BufferedOutputStream.flush() as hot spots in the accessibility case. Most of
that is caused by PDFDictionary, PDFArray and PDFName. And the strong weight on
these two is actually expected since Tagged PDF structures are all dictionaries
and arrays. Lots of them.

Look at the PDF sizes:
- Normal PDF: 105 KB (65 PDF Objects)
- Tagged PDF: 868 KB (6462 PDF Objects)

That's A LOT of additional content. All dictionaries and arrays that cannot be
compressed (in PDF 1.4). That also means a big increase in I/O output. So it's
in nature of tagged PDF that it must be considerably slower.

What I've tried now is to address the hot spot I found above. I got rid of the
Writers for encoding text output. Instead I switched to a StringBuilder that is
flushed to the OutputStream when necessary. That decreases the average
processing time after warm-up (IF->PDF case) from 775ms to 460ms (normal PDF
from 355ms to 325ms). That is a speed-up of:

(460 - 325) / (775 - 355) = 135 / 420 = 0.32 = -68%
So it cuts the tagged PDF penalty to a third.

That was the IF->PDF case. Here are the measurements for the FO->PDF case (the
same test document:

normal PDF: 772ms --> 712ms
tagged PDF: 1472ms --> 1042ms

normal PDF: 712 / 772 = 0.92 (-8%)
tagged PDF: 1042 / 1472 = 0.71 (-29%)
tagged PDF penalty: (1042 - 712) / (1472 - 772) = 330 / 700 = 0.47 (-53%)

There's a catch: This optimization requires a backwards-incompatible change in
the PDF library. The PDFWritable interface changes from
void outputInline(OutputStream out, Writer writer) throws IOException;
to
void outputInline(OutputStream out, StringBuilder textBuffer) throws
IOException;

The same applies to PDFObject.formatObject(). Both are very central parts of
the PDF library. It could invalidate pending patches or private additions from
third-parties. But it doesn't seem to be easy enough to write adapter code to
work around this.

-- 
Configure bugmail: https://issues.apache.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

DO NOT REPLY [Bug 51664] [PATCH] Tagged PDF performance improvement + tests

Reply via email to