Comparing PDFs

mehdi houshmand Mon, 05 Dec 2011 03:57:11 -0800

Hi,

To put it simply, I was wondering how easy/hard it would be to create
a PDF comparator? Before I go into my thoughts on this, I better give
the context of this issue just in case someone out there has a better
idea than mine.


I work on Apache FOP and we use Jeremias' PDF-Image support plugin to
import PDF pages into a PDF document, and this is done using PDF Box
and converting COS level objects to their equivalent in FOP (FOP has
it's own PDF library). FOP imports PDF images as /Form /XObjects and
executes the form ("/Form# Do") to render the page. If the same page
is imported several times, rather than creating a new form, the same
form is executed. However, one of our customers is asking for there to
be no duplication between imported PDF pages. What they are doing, is
using several different documents, some of which have renderably (not
a word, but I'm using it) identical pages i.e. when rendered they look
the same. As such, the "same" page can be imported as a /Form /XObject
several times, which is causing them problems. So what I want to do,
is create a unique identifier (hashCode() or CRC-esque) for each page,
such that we have a mechanise that can identify identical pages.

I realise this could be a huge task, and not nearly as simple as I'm
suggesting, considering there is much more to a PDF than what is
rendered. At the moment we don't care about MCIDs or the structured
tree or any PDF accessibility features, but the only thing I can think
of is creating a hashCode() and equals() method for every PD level
object in PDFBox. If any one has a better idea, which mostly likely is
any different idea, I'm all ears.

Thanks for your help, and for the continued efforts on PDFBox

Mehdi

Comparing PDFs

Reply via email to