adriano, adriano wrote > but then I would like to ask the question in more practical and > limited-scoped terms: is it possible, by using iText (preferably) or > another piece of Java software, to scan a PDF and at least detect the > presence (if not retrieve) of duplicate objects?
I doubt that generic PDF libraries already have a finished single method doing that; they generally do provide the means to access each and every single object (which can be reached from the current cross references), though. Thus, you can use them but have to implement the comparison yourself. You have to consider, though, what you consider a "duplicate". Do the contents have to be identical on a binary level? Or does it suffice if the the contents represent identical entities? (There e.g. are different ways to represent the same string or number, and the order of elements in a dictionary can be ignored.) And do you furthermore want to do a shallow comparison or a deep one? (If an object references other objects, do those objects have to have the identical object number? Or does it suffice if those referenced objects in turn represent identical entities?) adriano wrote > I understand from the PDF specs (although it's not clearly stated) that a > normal PDF document may only contain one (1) Catalog. Well, how do I check > that the DPF does not actually contains two Catalogs? Of course this will > hardly ever happen if the PDF was created using any honest PDF generating > software, but what if the PDF was intentionally tampered with in order to > cause problems to the consuming application? As Bruno already told you, PDFs may contain multiple Catalogs, or Metadata streams, or Info dictionaries, ... Especially in the context of incremental updates this can be observed more often than not. Concerning "although it's not clearly stated" --- if something is not clearly stated, you should be very sure before claiming a document not complying with your interpretation actually is not complying to the standard itself. Otherwise you can easily make a fool of yourself... Furthermore, if you are concerned that your documents might be "tampered with in order to cause problems to the consuming application," you should harden those applications. If you are afraid of documents tampered with during delivery to you, electronic signatures may be what you should be looking into. Regards, Michael -- View this message in context: http://itext-general.2136553.n4.nabble.com/Duplicate-indirect-objects-tp4657759p4657777.html Sent from the iText - General mailing list archive at Nabble.com. ------------------------------------------------------------------------------ Symantec Endpoint Protection 12 positioned as A LEADER in The Forrester Wave(TM): Endpoint Security, Q1 2013 and "remains a good choice" in the endpoint security space. For insight on selecting the right partner to tackle endpoint security challenges, access the full report. http://p.sf.net/sfu/symantec-dev2dev _______________________________________________ iText-questions mailing list iText-questions@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/itext-questions iText(R) is a registered trademark of 1T3XT BVBA. Many questions posted to this list can (and will) be answered with a reference to the iText book: http://www.itextpdf.com/book/ Please check the keywords list before you ask for examples: http://itextpdf.com/themes/keywords.php