adriano, adriano wrote > I am not referring to regular PDF documents, but intentionally altered > ones made up by the bad guys in order to try and cause problems to some > application. I am aware that a PDF document may have more than one > /Catalog if it has been revised. > So what I was asking about is how to find suspicious objects in a PDF, > like e.g. seemingly unused or duplicated ones of certain types (like a > double /Catalog when the document has no revisions) ....
I think you should beforehand clarify which expectations concerning incoming PDFs you have. Without that being done, there cannot be a concept of suspicious objects as there may be anything in documents in the wild. E.g. your PDFs might as a part of some use case be generated or finally manipulated by only one program. In that case that program may do its job always in a certain way which can be recognized in the resulting PDFs. In this case your expectations would be that such patterns can be recognized. (These patterns do depend on the very program, though!) As soon as that's done, you should define your term "suspicious objects" more clearly than your "e.g. ... like ...". If e.g. --- as speculated above --- you can expect the PDFs to expose certain pattern, suspicious objects would be those which break those patterns. But you have to be aware that such analysis requires a fairly homogenous document source (or at least a small collection of such sources). Alternatively, as you constantly mention manipulated PDFs, you might already have been receiving such bogus documents. If you have multiple such manipulated PDFs, you can analyze them and try to find manipulation patterns in them. These patterns should definitively stand out from the multitude of correct input documents, though, otherwise you'll get too many false suspects. If you really are trying to harden some process in which there is a good likelyhood of such manipulations in transport, you IMO should consider introducing electronic signatures (in a broad sense; i.e. as long as it is secure, anything goes, it does not necessarily have to involve legally backed qualified signatures) and reject any input without signature or with broken signatures. Regards, Michael PS: In general multiple objects of a type a document needs only one of, are not suspicious, even if there aren't multiple visible revisions. Some programs change PDFs by inserting new and changed objects before the cross reference table and updating that table to now represent the new and changed objects. Thus, no visible revisions but still those objects you consider suspicious... -- View this message in context: http://itext-general.2136553.n4.nabble.com/Duplicate-indirect-objects-tp4657759p4657797.html Sent from the iText - General mailing list archive at Nabble.com. ------------------------------------------------------------------------------ Symantec Endpoint Protection 12 positioned as A LEADER in The Forrester Wave(TM): Endpoint Security, Q1 2013 and "remains a good choice" in the endpoint security space. For insight on selecting the right partner to tackle endpoint security challenges, access the full report. http://p.sf.net/sfu/symantec-dev2dev _______________________________________________ iText-questions mailing list iText-questions@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/itext-questions iText(R) is a registered trademark of 1T3XT BVBA. Many questions posted to this list can (and will) be answered with a reference to the iText book: http://www.itextpdf.com/book/ Please check the keywords list before you ask for examples: http://itextpdf.com/themes/keywords.php