On 06/03/12 18:49, mehdi houshmand wrote:
> We had this exact same problem the last time you brought this issue to
> light and our approach was slightly different. Let me first ask you
> the question, are you 100% that fonts are the issue here?

I'm never 100% certain of anything. I suspect fonts are the issue, but
it's hard to prove. You're quite right that RIP behavior re form
XObjects could well be the problem; I hadn't realised the extent to
which RIPs might simply assume they're re-used across multiple pages and
never free them from RAM until you pointed it out.

> When the pdf-image-plugin is used, ALL pdf-images are imported and
> wholesale creating a new XObject Form for each page. Now, this works
> perfectly fine for smaller documents, however, it can blow the memory
> stack on RIPs for larger docs. The reason being XObjects are treated
> as global resources of the PDF, as such, it is possible to create the
> XObject and use it multiple times. However, this means that each
> XObject and its resources, are being stored in memory on the RIP.

Ugh. A well-designed RIP should be able to load XObject forms on demand
and free them under memory pressure. After all, an image is also a
global resource that can be referenced multiple times across different
pages (an indirect object with a stream), but PDFs with large numbers of
images don't typically crash RIPs. There's no excuse for lots of small
indirect objects crashing a RIP, be they images or form xobjects.

The actual XObject dictionary may have to stay loaded since form
XObjects are named in a global namespace, but the XObject's resources
dictionary, content stream(s), etc certainly don't have to.
Unfortunately, that doesn't mean real-world RIPs will actually release
those resources under memory pressure just because they can. Since it's
hard to guess whether a form XObject will be referenced over and over or
used only once, this isn't that surprising.

> This is different to how a RIP can handle a /Page object. When
> printing/rendering a /Page object, the RIP only needs the page's
> content stream and any resources it references in memory. Once the
> page is rendered, the memory can be cleared.
The same is technically true of rendering a form XObject. Once you've
drawn it, you can discard its content stream from memory and discard any
resources you loaded from its resources dictionary. The trouble is that
you don't know if you'll just be loading it again for the next page.
It'd be fairly simple to keep a LRU list of form XObjects so they get
unloaded if they're not referenced after a few pages are processed and
there's memory pressure. I won't be too surprised if most RIPs don't do
this, though.

> When PDFBox merges docs,
> it doesn't use the XObject Form, it does so by appending /Page
> objects. This is the solution we came to, just adding a PDFBox merger
> to the pipeline.
If you're merging documents as whole pages, where you're plucking pages
from a source document and putting them unmodified into an output
document, that's entirely practical.

If you want to use PDFs as image-like resources within a page (as I do)
then you can't just append the /Page object from the source PDF. As I
understand it (I haven't implemented this) it's necessary to:

* Extract the /Page's content stream(s) plus all resources referenced
* Append the referenced resource(s) to the target page's resource
dictionary, allocating new object numbers as you copy a resource and
changing the target of any indirect references to match the new object
number
* Insert the concatenated content streams from the source PDF into the
output content stream. They must be surrounded by appropriate graphics
state save and restore operators and any necessary scale/position
operations to place the content where you want it.

It's a *LOT* more complicated to get right than embedding an XObject,
not least because two different source PDFs may have resource dictionary
entries with the same name, forcing you to actually parse and rewrite
the content streams to prevent resource name clashes!

I looked at this approach a while ago in another project and ran
screaming. Form XObjects make sure the placed PDF is self-contained,
getting rid of naming clashes in the resources dictionary and ensuring
it's fairly sane to embed in another page.

Doing the above in fop would be even worse, because FOP has its own PDF
library so everything fop-pdf-image reads from pdfbox must be translated
into FOP pdf structures. Still, most of that is in place in
fop-pdf-image, so it *might* be worth tackling. I'm really hoping it's
not necessary, though, because merging and appending resources dicts and
content streams is *ugly* work. It could be done with a PDFBox
PDFStreamEngine, but it wouldn't be fun.

> So with that in mind, what exactly are you trying to do? Why are you
> using FOP to merge PDFs?
I'm using FOP to produce documents containing a mixture of automatically
typeset formatted text and graphics. Many of the graphics are PDF
documents, and need to be PDF documents because they contain vector
artwork and text that would lose quality and grow massively in size if
embedded in rasterised form.

I'm *NOT* trying to use fop to concatenate PDF pages, to impose PDFs, or
any of that. It'd make very little sense to do that.

> Do you need FOP to do this work?
I either need fop, TeX, or need to write my own document layout system.
The latter would be insane - why implement text justification and flow
algorithms, etc, when it's already well established in fop?

> Have you
> tried merging PDFs with PDFBox and seeing how that affects the RIP?
I haven't, and it's worth a try. It'd produce a document containing many
hundreds of small irregular shaped pages, as each input PDF is quite
small. It'd certainly help confirm whether the issue was XObject form
use, or whether it was font duplication.

--
Craig Ringer

Reply via email to