Hi Craig,

Excellent!!! I think we're making some progress here!

<snip>
> Ugh. A well-designed RIP should be able to load XObject forms on demand
> and free them under memory pressure. After all, an image is also a
> global resource that can be referenced multiple times across different
> pages (an indirect object with a stream), but PDFs with large numbers of
> images don't typically crash RIPs. There's no excuse for lots of small
> indirect objects crashing a RIP, be they images or form xobjects.

The operative word there is "well-designed", but also, I think you're
making a lot of assumptions about how the RIP handles these object. I
don't disagree with your assumptions, but I'm just saying, you don't
know how the RIP handles these objects so you have to be careful.

<snip/>
> The same is technically true of rendering a form XObject. Once you've
> drawn it, you can discard its content stream from memory and discard any
> resources you loaded from its resources dictionary. The trouble is that
> you don't know if you'll just be loading it again for the next page.
> It'd be fairly simple to keep a LRU list of form XObjects so they get
> unloaded if they're not referenced after a few pages are processed and
> there's memory pressure. I won't be too surprised if most RIPs don't do
> this, though.

Yeah, again, assuming the people who designed the code designed it to
be robust and flexible is a dangerous assumption I think.

 <snip>
> If you want to use PDFs as image-like resources within a page (as I do)
> then you can't just append the /Page object from the source PDF. As I
> understand it (I haven't implemented this) it's necessary to:
>
> * Extract the /Page's content stream(s) plus all resources referenced
> * Append the referenced resource(s) to the target page's resource
> dictionary, allocating new object numbers as you copy a resource and
> changing the target of any indirect references to match the new object
> number
> * Insert the concatenated content streams from the source PDF into the
> output content stream. They must be surrounded by appropriate graphics
> state save and restore operators and any necessary scale/position
> operations to place the content where you want it.

HA HA!! Incorrect! If you look into the nooks and crannies of the PDF
spec, you'll see that it's possible to use content stream arrays for
the /Page content stream. I'll leave exploring that to you, but
basically it makes overlaying pages much much simpler. In related
news, PDFBox does just that!! What we did (and it's super hack, but it
worked) is if there we pages with both PDF-image content and FOP
generated content, we'd get FOP to generate the content without the
PDF-image and just overlay the pages. Best of both worlds!! (Though
the purist in me is very much aggrieved)

Ok, so maybe I'll add some transparency as to how we came to some of
these decisions. The client told us that PDFs ~16k pages with with
6-8k XObjects (I *heart* grep) were disproportionally slow and that
fonts were to blame, so obviously that's where we started. I managed
to do some font de-duping of Type1 fonts (seen as FOP doesn't subset
these), it was horrendous, the fidelity was terrible but I was just
experimenting. This made some impact, but not enough. So after some
more experimentation, proving fonts weren't to blame, we had to step
back and look at the problem again. We also, found out that the RIP
times didn't correlate to the size of the document i.e. x pages takes
y time, 2x was taking 10y time (if that makes sense). This made us
think it was a memory issue, some how the RIPs memory was filling up.
A lot of faffing about later, and we got to the conclusions I've
described.

The more you describe your problem, the more it sounds like you need
to do exactly what we did, but just to be sure, I thought I'd explain
how we got there. Assumptions are a dangerous thing and I've probably
made some about your issue too.

Hopefully we can get to some resolution about this soon,

Mehdi

Reply via email to