It's in MAJOR need of updating (a project which has been on my list for too long), but you are welcome to use any of the files at <http://acroeng.adobe.com>. This is NOT an officially supported collection of test files - just some stuff that we've put online for our own internal testing...
You did notice that the PDFReference/ISO 32000 has bookmarks and a table of contents that make navigating it VERY easy, right?? >From the strange dumps below, it looks like you are trying to treat the PDF >file as a text file - and thus are only getting PDF object information that is >uncompressed. The actual page contents (which may or may not be grep-pable >anyway) are compressed in a stream object. Also, text is really a set of >drawing instructions that may or may not be in any logical order. A read of >the graphics and text sections of the PDFRef/ISO32K will improve your >understanding here. The issues of automated testing on PDF generation/processing is a VERY long standing debate among folks for many reasons relating to WHAT are you trying to compare (PDF objects, page content, etc.) to HOW such comparisons should be done. There are some tools that can be used such as jPDFUnit, Image/Graphicsmagick, etc. But you need to first define WHAT you are trying to accomplish with automated testing. Leonard -----Original Message----- From: Mike Marchywka [mailto:[email protected]] Sent: Saturday, March 28, 2009 10:30 AM To: [email protected] Subject: [iText-questions] question regarding reference PDF's and general testing approach and contributions I hoping to get some time to plow through a pdf dump utility with special interest in the structure components of the document. Are there any particular pdf reference document sets for testing various parts of pdf or Reader that I may find helpful? I've got a bunch of these pdf files from a recent svn check out. Any one know which may have structure features or interest? After reading a few sentences of the spec/reference from ADBE( [flame] I finally used pdftotext on it and am grepping it now for things like "logical" , LOL) , I ran a dump on the annotations from two US IRS publications, one a 1040 tax form and the other an instruction publication but I have no idea what I have done or missed to further my objectives. Again, the idea here is to get the numbers out of the 1040 form with some association to something of relevance ( "this is line 3 on the form") and in the case of the instructions just dump text in logical order. I was hoping to examine the instruction publication for information relating various pieces of text- is a given piece of text part of a multi-column page that should be traversed in a zig-zag format or is it part of a table, which should still be read left-to-right and then down. My utility does this, // reflect dump is my test code PdfReflectDump p = new PdfReflectDump(0,os); [...] os.print("[page"+j+"]"); PdfDictionary pd= pdfreader.getPageN(j); PdfObject pa=pd.get(PdfName.ANNOTS); if (pa==null){os.print("null "); } else{ p.lut(pa,os);} os.println(); >From the tax form, I got a whole bunch of things with a vocabulary list that boils down to things like ( " freqassoc" is a frequency counting perl script that outputs a voacbulary list with occurence counts ) $ ./myrun -read ../pdf/f.pdf -annots | sed -e 's/ /\n/g' | grep [A-Z] | freqassoc | sort -g -r | more 1125 O-cut < my output text for circular ref loop... 258 Tf 216 HeBo 42 ZaDb 21 fld 14 checkThisBox 7 this 7 isBoxChecked 7 getField 7 false 7 MainCalculation 1 Zcaron 1 Ydieresis 1 Yacute 1 Widget 1 Universal-NewswithCommPi 1 Universal-GreekwithMathPi 1 Ugrave 1 Udieresis 1 Ucircumflex etc but if I do this on the instruction document and carefullly look at page numbers, I get relatively few items and they are lacking from the pages with all the multi-column text. It seems that some of these pages do have multi-column and single column or bullet points within colum features but if they are described logically I can't figure out how to obtain the information from here. Thanks. $ ./myrun -read ../pdf/541.pdf -annots | more [page1]arraysz=1 arraysz=3 0.0 0.31001 1.0 /Annot arraysz=3 0.0 0.0 0.0 www.irs.gov /URI /Link arraysz=4 200.014 131.941 274.812 120.831 [page2]arraysz=3 arraysz=3 0.0 0.31001 1.0 O-cut arraysz=3 0.0 0.0 0.0 mailto:*[email protected] O-cut O-cut arraysz=4 124.1 422.944 189.564 413.184 arraysz=3 0.0 0.31001 1.0 O-cut arraysz=3 0.0 0.0 0.0 http://www.irs.gov O-cut O-cut arraysz=4 59.944 342.252 103.056 332.491 arraysz=3 0.0 0.31001 1.0 O-cut arraysz=3 0.0 0.0 0.0 http://www.irs.gov/formspubs/ O-cut O-cut arraysz=4 42.0 299.557 124.68 289.796 [page3]null [page4]null [page5]null [page6]null [page7]null Making use of the nicely formatted itext source code, I think I can generate a list of available names but now not sure how to look for other things that may be relevant, $ cat ../../src/core/com/lowagie/text/pdf/PdfName.java | grep final | awk '{print $5}' | more [...] ANNOT ANTIALIAS ANNOTS AP APPDEFAULT ARTBOX [...] BBOX [...] BLEEDBOX BLINDS BM BORDER BOUNDS BOX BS BTN BYTERANGE C C0 C1 CA ca CALGRAY CALRGB CAPHEIGHT CATALOG CATEGORY[...] COLLECTION COLLECTIONFIELD COLLECTIONITEM COLLECTIONSCHEMA COLLECTIONSORT COLLECTIONSUBITEM COLUMNS CONTACTINFO CONTENT CONTENTS COORDS COUNT COURIER [...] CROPBOX CRYPT CS D DA DATA DC DCTDECODE DECODE DECODEPARMS DEFAULTCRYPTFILTER DEFAULTCMYK DEFAULTGRAY DEFAULTRGB DESC DESCENDANTFONTS DESCENT DEST DESTOUTPUTPROFILE DESTS DEVICEGRAY DEVICERGB DEVICECMYK DI DIFFERENCES DISSOLVE DIRECTION DISPLAYDOCTITLE DIV DM DOCMDP DOCOPEN DOMAIN DP DR DS DUR DUPLEX DUPLEXFLIPSHORTEDGE DUPLEXFLIPLONGEDGE DV DW E EARLYCHANGE EF EFF [...] ENDOFBLOCK ENDOFLINE EXTEND EXTGSTATE EXPORT EXPORTSTATE EVENT F FB FDECODEPARMS FDF FF FFILTER FIELDS FILEATTACHMENT FILESPEC FILTER FIRST FIRSTCHAR FIRSTPAGE FIT FITH FITV FITR FITB FITBH FITBV FITWINDOW FLAGS FLATEDECODE FO FONT FONTBBOX FONTDESCRIPTOR FONTFILE FONTFILE2 FONTFILE3 FONTMATRIX FONTNAME FORM FORMTYPE FREETEXT FRM FS FT FULLSCREEN FUNCTION FUNCTIONS FUNCTIONTYPE GAMMA GBK GLITTER GOTO GOTOE GOTOR GROUP GTS_PDFA1 GTS_PDFX GTS_PDFXVERSION H HEIGHT HELV HELVETICA HELVETICA_BOLD HELVETICA_OBLIQUE HELVETICA_BOLDOBLIQUE HID HIDE HIDEMENUBAR HIDETOOLBAR HIDEWINDOWUI HIGHLIGHT I ICCBASED ID IDENTITY IF IMAGE IMAGEB IMAGEC IMAGEI IMAGEMASK INDEX INDEXED INFO INK [...] JAVASCRIPT JBIG2DECODE JBIG2GLOBALS JPXDECODE JS K KEYWORDS KIDS L L2R LANG LANGUAGE LAST LASTCHAR LASTPAGE LAUNCH LENGTH LENGTH1 LIMITS LINE LINK LISTMODE LOCATION LOCK LOCKED [...] MARKED MARKINFO MASK MAX MAXLEN MEDIABOX MCID MCR METADATA MIN MK MMTYPE1 MODDATE N N0 N1 N2 N3 N4 NAME NAMED NAMES NEEDAPPEARANCES NEWWINDOW NEXT NEXTPAGE NM NONE NONFULLSCREENPAGEMODE NUMCOPIES NUMS O OBJ OBJR OBJSTM OC OCG OCGS OCMD OCPROPERTIES Off OFF ON ONECOLUMN OPEN OPENACTION OP op OPM OPT ORDER ORDERING OUTLINES OUTPUTCONDITION OUTPUTCONDITIONIDENTIFIER OUTPUTINTENT OUTPUTINTENTS P PAGE [...] PROPERTIES PS PUBSEC Q QUADPOINTS R R2L RANGE RC RBGROUPS REASON [...] SECT SEPARATION SETOCGSTATE SHADING SHADINGTYPE SHIFT_JIS SIG SIGFLAGS SIGREF SIMPLEX SINGLEPAGE SIZE SMASK SORT SPAN SPLIT SQUARE SQUIGGLY ST STAMP STANDARD STATE STDCF STEMV STMF STRF STRIKEOUT STRUCTPARENT STRUCTPARENTS STRUCTTREEROOT STYLE [...] TWOCOLUMNLEFT TWOCOLUMNRIGHT TWOPAGELEFT TWOPAGERIGHT TX TYPE [...] USEATTACHMENTS USENONE USEOC USEOUTLINES USER USERPROPERTIES USERUNIT USETHUMBS V V2 VERISIGN_PPKVS VERSION VIEW VIEWAREA VIEWCLIP VIEWERPREFERENCES VIEWSTATE VISIBLEPAGES W W2 WC WIDGET WIDTH WIDTHS WIN WIN_ANSI_ENCODING WIPE WHITEPOINT [...] For that matter, and to start another flame war, what is the general approach to automated testing or various pdf generation or manipulation code? Can you do nightly builds and check the output with some automated tool besides a binary file compare? If you check in new stuff and use it to generate pdf output, do you just do a binary check of the output against reference output or hope to find deviations that are only observable by human eyes ( and by "eyes" I'm using the term to be specific. That is, imperceptible output differences or differences in rendering time or file size would not be relevant [ flame- based on some results I've seen, it never has been LOL ] [ flame- if you are generating output which is designed not to be machine readable, how do you know if your target audience is still happy with the result? ] Thanks. > Subject: Re: [iText-questions] contribution: FontReplacingPdfSmartCopy: > duplicate TTF font subset merging and replacement > > > To make sure the contribution isn't overlooked, you can post the > patch on the SourceForge tracker: > http://sourceforge.net/tracker/?group_id=15255&atid=315255 > > And you can also send a scan of the signed CLA. > > The items in the tracker will be looked at next week. > -- _________________________________________________________________ Express your personality in color! Preview and select themes for Hotmail(r). http://www.windowslive-hotmail.com/LearnMore/personalize.aspx?ocid=TXT_MSGTX_WL_HM_express_032009#colortheme ------------------------------------------------------------------------------ _______________________________________________ iText-questions mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/itext-questions Buy the iText book: http://www.1t3xt.com/docs/book.php ------------------------------------------------------------------------------ _______________________________________________ iText-questions mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/itext-questions Buy the iText book: http://www.1t3xt.com/docs/book.php
