Re: [iText-questions] question regarding reference PDF's and general testing approach and contributions

Leonard Rosenthol Sun, 29 Mar 2009 06:58:00 -0700

It's in MAJOR need of updating (a project which has been on my list for too 
long), but you are welcome to use any of the files at 
<http://acroeng.adobe.com>.  This is NOT an officially supported collection of 
test files - just some stuff that we've put online for our own internal 
testing...


You did notice that the PDFReference/ISO 32000 has bookmarks and a table of 
contents that make navigating it VERY easy, right?? 

>From the strange dumps below, it looks like you are trying to treat the PDF 
>file as a text file - and thus are only getting PDF object information that is 
>uncompressed.  The actual page contents (which may or may not be grep-pable 
>anyway) are compressed in a stream object.  Also, text is really a set of 
>drawing instructions that may or may not be in any logical order.  A read of 
>the graphics and text sections of the PDFRef/ISO32K will improve your 
>understanding here.

The issues of automated testing on PDF generation/processing is a VERY long 
standing debate among folks for many reasons relating to WHAT are you trying to 
compare (PDF objects, page content, etc.) to HOW such comparisons should be 
done.  There are some tools that can be used such as jPDFUnit, 
Image/Graphicsmagick, etc.  But you need to first define WHAT you are trying to 
accomplish with automated testing.

Leonard

-----Original Message-----
From: Mike Marchywka [mailto:[email protected]] 
Sent: Saturday, March 28, 2009 10:30 AM
To: [email protected]
Subject: [iText-questions] question regarding reference PDF's and general 
testing approach and contributions


I hoping to get some time to plow through a pdf
dump utility with special interest in the structure
components of the document. Are there any particular
pdf reference document sets for testing various parts of
pdf or Reader that I may find helpful? 
I've got a bunch of these pdf files from a recent svn
check out. Any one know which may have structure features
or interest? 

After reading a few sentences of the spec/reference
from ADBE( [flame] I finally used pdftotext on it and am grepping
it now for things like "logical" , LOL) , 
I ran a dump on the annotations from two US IRS publications,
one a 1040 tax form and the other an instruction publication
but I have no idea what I have done or missed to further
my objectives. Again, the idea here is to get the numbers
out of the 1040 form with some association to something
of relevance ( "this is line 3 on the form") and in the case
of the instructions just dump text in logical order.
I was hoping to examine the instruction publication
for information relating various pieces of text-
is a given piece of text part of a multi-column page
that should be traversed in a zig-zag format or
is it part of a table, which should still be read left-to-right
and then down. 


My utility does this,
// reflect dump is my test code
PdfReflectDump p = new PdfReflectDump(0,os);
[...]    
os.print("[page"+j+"]");
PdfDictionary pd= pdfreader.getPageN(j);
PdfObject pa=pd.get(PdfName.ANNOTS);
if (pa==null){os.print("null "); }
else{ p.lut(pa,os);}
os.println();
                

>From the tax form, I got a whole bunch of things
with a vocabulary list that boils down to things
like 
( " freqassoc" is a frequency counting perl script that
outputs a voacbulary list with occurence counts ) 

$ ./myrun -read ../pdf/f.pdf -annots | sed -e 's/ /\n/g' | grep [A-Z] | 
freqassoc | sort -g -r | more

1125 O-cut < my output text for circular ref loop... 
258 Tf
216 HeBo
42 ZaDb
21 fld
14 checkThisBox
7 this
7 isBoxChecked
7 getField
7 false
7 MainCalculation
1 Zcaron
1 Ydieresis
1 Yacute
1 Widget
1 Universal-NewswithCommPi
1 Universal-GreekwithMathPi
1 Ugrave
1 Udieresis
1 Ucircumflex
 
etc

but if I do this on the instruction document
and carefullly look at page numbers, I get
relatively few items and they are lacking from
the pages with all the multi-column text.
It seems that some of these pages do have
multi-column and single column or bullet points
within colum features but if they are described
logically I can't figure out how to obtain
the information from here. Thanks.

$ ./myrun -read ../pdf/541.pdf -annots  | more


[page1]arraysz=1 
arraysz=3 
 0.0  0.31001  1.0  /Annot  arraysz=3 
 0.0  0.0  0.0   www.irs.gov  /URI    /Link  arraysz=4 
 200.014  131.941  274.812  120.831  
[page2]arraysz=3 
arraysz=3 
 0.0  0.31001  1.0  O-cut 
 arraysz=3 
 0.0  0.0  0.0   mailto:*[email protected]  O-cut 
   O-cut 
 arraysz=4 
 124.1  422.944  189.564  413.184  arraysz=3 
 0.0  0.31001  1.0  O-cut 
 arraysz=3 
 0.0  0.0  0.0   http://www.irs.gov  O-cut 
   O-cut 
 arraysz=4 
 59.944  342.252  103.056  332.491  arraysz=3 
 0.0  0.31001  1.0  O-cut 
 arraysz=3 
 0.0  0.0  0.0   http://www.irs.gov/formspubs/  O-cut 
   O-cut 
 arraysz=4 
 42.0  299.557  124.68  289.796  
[page3]null 
[page4]null 
[page5]null 
[page6]null 
[page7]null 

Making use of the nicely formatted itext source code, I think
I can generate a list of available names but now not
sure how to look for other things that may be relevant,

$ cat ../../src/core/com/lowagie/text/pdf/PdfName.java | grep final | awk 
'{print $5}' | more

[...] ANNOT ANTIALIAS ANNOTS AP APPDEFAULT ARTBOX [...]
 BBOX [...] BLEEDBOX BLINDS BM BORDER BOUNDS BOX BS BTN BYTERANGE C C0 C1 CA ca 
CALGRAY CALRGB CAPHEIGHT CATALOG CATEGORY[...]
COLLECTION COLLECTIONFIELD COLLECTIONITEM COLLECTIONSCHEMA COLLECTIONSORT 
COLLECTIONSUBITEM COLUMNS CONTACTINFO CONTENT CONTENTS COORDS COUNT COURIER 
[...] CROPBOX CRYPT CS D DA DATA DC DCTDECODE DECODE DECODEPARMS 
DEFAULTCRYPTFILTER DEFAULTCMYK DEFAULTGRAY DEFAULTRGB DESC DESCENDANTFONTS 
DESCENT DEST DESTOUTPUTPROFILE DESTS DEVICEGRAY DEVICERGB DEVICECMYK DI 
DIFFERENCES DISSOLVE DIRECTION DISPLAYDOCTITLE DIV DM DOCMDP DOCOPEN DOMAIN DP 
DR DS DUR DUPLEX DUPLEXFLIPSHORTEDGE DUPLEXFLIPLONGEDGE DV DW E EARLYCHANGE EF 
EFF [...] ENDOFBLOCK ENDOFLINE EXTEND EXTGSTATE EXPORT EXPORTSTATE EVENT F FB 
FDECODEPARMS FDF FF FFILTER FIELDS FILEATTACHMENT FILESPEC FILTER FIRST 
FIRSTCHAR FIRSTPAGE FIT FITH FITV FITR FITB FITBH FITBV FITWINDOW FLAGS 
FLATEDECODE FO FONT FONTBBOX FONTDESCRIPTOR FONTFILE FONTFILE2 FONTFILE3 
FONTMATRIX FONTNAME FORM FORMTYPE FREETEXT FRM FS FT FULLSCREEN FUNCTION 
FUNCTIONS FUNCTIONTYPE GAMMA GBK GLITTER GOTO GOTOE GOTOR GROUP GTS_PDFA1 
GTS_PDFX GTS_PDFXVERSION H HEIGHT HELV HELVETICA HELVETICA_BOLD 
HELVETICA_OBLIQUE HELVETICA_BOLDOBLIQUE HID HIDE HIDEMENUBAR HIDETOOLBAR 
HIDEWINDOWUI HIGHLIGHT I ICCBASED ID IDENTITY IF IMAGE IMAGEB IMAGEC IMAGEI 
IMAGEMASK INDEX INDEXED INFO INK [...]
JAVASCRIPT JBIG2DECODE JBIG2GLOBALS JPXDECODE JS K KEYWORDS KIDS L L2R LANG 
LANGUAGE LAST LASTCHAR LASTPAGE LAUNCH LENGTH LENGTH1 LIMITS LINE LINK LISTMODE 
LOCATION LOCK LOCKED [...]
 MARKED MARKINFO MASK MAX MAXLEN MEDIABOX MCID MCR METADATA MIN MK MMTYPE1 
MODDATE N N0 N1 N2 N3 N4 NAME NAMED NAMES NEEDAPPEARANCES NEWWINDOW NEXT 
NEXTPAGE NM NONE NONFULLSCREENPAGEMODE NUMCOPIES NUMS O OBJ OBJR OBJSTM OC OCG 
OCGS OCMD OCPROPERTIES Off OFF ON ONECOLUMN OPEN OPENACTION OP op OPM OPT ORDER 
ORDERING OUTLINES OUTPUTCONDITION OUTPUTCONDITIONIDENTIFIER OUTPUTINTENT 
OUTPUTINTENTS P PAGE [...]
PROPERTIES PS PUBSEC Q QUADPOINTS R R2L RANGE RC RBGROUPS REASON [...]
SECT SEPARATION SETOCGSTATE SHADING SHADINGTYPE SHIFT_JIS SIG SIGFLAGS SIGREF 
SIMPLEX SINGLEPAGE SIZE SMASK SORT SPAN SPLIT SQUARE SQUIGGLY ST STAMP STANDARD 
STATE STDCF STEMV STMF STRF STRIKEOUT STRUCTPARENT STRUCTPARENTS STRUCTTREEROOT 
STYLE [...]
 TWOCOLUMNLEFT TWOCOLUMNRIGHT TWOPAGELEFT TWOPAGERIGHT TX TYPE [...]
USEATTACHMENTS USENONE USEOC USEOUTLINES USER USERPROPERTIES USERUNIT USETHUMBS 
V V2 VERISIGN_PPKVS VERSION VIEW VIEWAREA VIEWCLIP VIEWERPREFERENCES VIEWSTATE 
VISIBLEPAGES W W2 WC WIDGET WIDTH WIDTHS WIN WIN_ANSI_ENCODING WIPE WHITEPOINT 
[...]


For that matter, and to start another flame war,

what is the general approach to automated testing

or various pdf generation or manipulation code?

Can you do nightly builds and check the output

with some automated tool besides a binary file compare?

If you check in new stuff and use it to generate

pdf output, do you just do a binary check of the output

against reference output or hope to find 
deviations that are only observable by human

eyes ( and by "eyes" I'm using the term to be

specific. That is, imperceptible output differences

or differences in rendering time or file size would 

not be relevant

[ flame- based on some results I've seen, it never has been LOL ]

[ flame- if you are generating output which is designed not

to be machine readable, how do you know if your target audience

is still happy with the result? ]





Thanks.



> Subject: Re: [iText-questions] contribution: FontReplacingPdfSmartCopy:
> duplicate TTF font subset merging and replacement
>
>
> To make sure the contribution isn't overlooked, you can post the
> patch on the SourceForge tracker:
> http://sourceforge.net/tracker/?group_id=15255&atid=315255
>
> And you can also send a scan of the signed CLA.
>
> The items in the tracker will be looked at next week.
> --


_________________________________________________________________
Express your personality in color! Preview and select themes for Hotmail(r).
http://www.windowslive-hotmail.com/LearnMore/personalize.aspx?ocid=TXT_MSGTX_WL_HM_express_032009#colortheme
------------------------------------------------------------------------------
_______________________________________________
iText-questions mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/itext-questions

Buy the iText book: http://www.1t3xt.com/docs/book.php

------------------------------------------------------------------------------
_______________________________________________
iText-questions mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/itext-questions

Buy the iText book: http://www.1t3xt.com/docs/book.php

Re: [iText-questions] question regarding reference PDF's and general testing approach and contributions

Reply via email to