adriano,

adriano wrote
> but then I would like to ask the question in more practical and
> limited-scoped terms: is it possible, by using iText (preferably) or
> another piece of Java software, to scan a PDF and at least detect the
> presence (if not retrieve) of duplicate objects?

I doubt that generic PDF libraries already have a finished single method
doing that; they generally do provide the means to access each and every
single object (which can be reached from the current cross references),
though. Thus, you can use them but have to implement the comparison
yourself.

You have to consider, though, what you consider a "duplicate". Do the
contents have to be identical on a binary level? Or does it suffice if the
the contents represent identical entities? (There e.g. are different ways to
represent the same string or number, and the order of elements in a
dictionary can be ignored.) And do you furthermore want to do a shallow
comparison or a deep one? (If an object references other objects, do those
objects have to have the identical object number? Or does it suffice if
those referenced objects in turn represent identical entities?)


adriano wrote
> I understand from the PDF specs (although it's not clearly stated) that a
> normal PDF document may only contain one (1) Catalog. Well, how do I check
> that the DPF does not actually contains two Catalogs? Of course this will
> hardly ever happen if the PDF was created using any honest PDF generating
> software, but what if the PDF was intentionally tampered with in order to
> cause problems to the consuming application?

As Bruno already told you, PDFs may contain multiple Catalogs, or Metadata
streams, or Info dictionaries, ... Especially in the context of incremental
updates this can be observed more often than not.

Concerning "although it's not clearly stated" --- if something is not
clearly stated, you should be very sure before claiming a document not
complying with your interpretation actually is not complying to the standard
itself. Otherwise you can easily make a fool of yourself...

Furthermore, if you are concerned that your documents might be "tampered
with in order to cause problems to the consuming application," you should
harden those applications. If you are afraid of documents tampered with
during delivery to you, electronic signatures may be what you should be
looking into.

Regards,   Michael



--
View this message in context: 
http://itext-general.2136553.n4.nabble.com/Duplicate-indirect-objects-tp4657759p4657777.html
Sent from the iText - General mailing list archive at Nabble.com.

------------------------------------------------------------------------------
Symantec Endpoint Protection 12 positioned as A LEADER in The Forrester  
Wave(TM): Endpoint Security, Q1 2013 and "remains a good choice" in the  
endpoint security space. For insight on selecting the right partner to 
tackle endpoint security challenges, access the full report. 
http://p.sf.net/sfu/symantec-dev2dev
_______________________________________________
iText-questions mailing list
iText-questions@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/itext-questions

iText(R) is a registered trademark of 1T3XT BVBA.
Many questions posted to this list can (and will) be answered with a reference 
to the iText book: http://www.itextpdf.com/book/
Please check the keywords list before you ask for examples: 
http://itextpdf.com/themes/keywords.php

Reply via email to