I hoping to get some time to plow through a pdf
dump utility with special interest in the structure
components of the document. Are there any particular
pdf reference document sets for testing various parts of
pdf or Reader that I may find helpful?
I've got a bunch of these pdf files from a recent svn
check out. Any one know which may have structure features
or interest?
After reading a few sentences of the spec/reference
from ADBE( [flame] I finally used pdftotext on it and am grepping
it now for things like "logical" , LOL) ,
I ran a dump on the annotations from two US IRS publications,
one a 1040 tax form and the other an instruction publication
but I have no idea what I have done or missed to further
my objectives. Again, the idea here is to get the numbers
out of the 1040 form with some association to something
of relevance ( "this is line 3 on the form") and in the case
of the instructions just dump text in logical order.
I was hoping to examine the instruction publication
for information relating various pieces of text-
is a given piece of text part of a multi-column page
that should be traversed in a zig-zag format or
is it part of a table, which should still be read left-to-right
and then down.
My utility does this,
// reflect dump is my test code
PdfReflectDump p = new PdfReflectDump(0,os);
[...]
os.print("[page"+j+"]");
PdfDictionary pd= pdfreader.getPageN(j);
PdfObject pa=pd.get(PdfName.ANNOTS);
if (pa==null){os.print("null "); }
else{ p.lut(pa,os);}
os.println();
>From the tax form, I got a whole bunch of things
with a vocabulary list that boils down to things
like
( " freqassoc" is a frequency counting perl script that
outputs a voacbulary list with occurence counts )
$ ./myrun -read ../pdf/f.pdf -annots | sed -e 's/ /\n/g' | grep [A-Z] |
freqassoc | sort -g -r | more
1125 O-cut < my output text for circular ref loop...
258 Tf
216 HeBo
42 ZaDb
21 fld
14 checkThisBox
7 this
7 isBoxChecked
7 getField
7 false
7 MainCalculation
1 Zcaron
1 Ydieresis
1 Yacute
1 Widget
1 Universal-NewswithCommPi
1 Universal-GreekwithMathPi
1 Ugrave
1 Udieresis
1 Ucircumflex
etc
but if I do this on the instruction document
and carefullly look at page numbers, I get
relatively few items and they are lacking from
the pages with all the multi-column text.
It seems that some of these pages do have
multi-column and single column or bullet points
within colum features but if they are described
logically I can't figure out how to obtain
the information from here. Thanks.
$ ./myrun -read ../pdf/541.pdf -annots | more
[page1]arraysz=1
arraysz=3
0.0 0.31001 1.0 /Annot arraysz=3
0.0 0.0 0.0 www.irs.gov /URI /Link arraysz=4
200.014 131.941 274.812 120.831
[page2]arraysz=3
arraysz=3
0.0 0.31001 1.0 O-cut
arraysz=3
0.0 0.0 0.0 mailto:*[email protected] O-cut
O-cut
arraysz=4
124.1 422.944 189.564 413.184 arraysz=3
0.0 0.31001 1.0 O-cut
arraysz=3
0.0 0.0 0.0 http://www.irs.gov O-cut
O-cut
arraysz=4
59.944 342.252 103.056 332.491 arraysz=3
0.0 0.31001 1.0 O-cut
arraysz=3
0.0 0.0 0.0 http://www.irs.gov/formspubs/ O-cut
O-cut
arraysz=4
42.0 299.557 124.68 289.796
[page3]null
[page4]null
[page5]null
[page6]null
[page7]null
Making use of the nicely formatted itext source code, I think
I can generate a list of available names but now not
sure how to look for other things that may be relevant,
$ cat ../../src/core/com/lowagie/text/pdf/PdfName.java | grep final | awk
'{print $5}' | more
[...] ANNOT ANTIALIAS ANNOTS AP APPDEFAULT ARTBOX [...]
BBOX [...] BLEEDBOX BLINDS BM BORDER BOUNDS BOX BS BTN BYTERANGE C C0 C1 CA ca
CALGRAY CALRGB CAPHEIGHT CATALOG CATEGORY[...]
COLLECTION COLLECTIONFIELD COLLECTIONITEM COLLECTIONSCHEMA COLLECTIONSORT
COLLECTIONSUBITEM COLUMNS CONTACTINFO CONTENT CONTENTS COORDS COUNT COURIER
[...] CROPBOX CRYPT CS D DA DATA DC DCTDECODE DECODE DECODEPARMS
DEFAULTCRYPTFILTER DEFAULTCMYK DEFAULTGRAY DEFAULTRGB DESC DESCENDANTFONTS
DESCENT DEST DESTOUTPUTPROFILE DESTS DEVICEGRAY DEVICERGB DEVICECMYK DI
DIFFERENCES DISSOLVE DIRECTION DISPLAYDOCTITLE DIV DM DOCMDP DOCOPEN DOMAIN DP
DR DS DUR DUPLEX DUPLEXFLIPSHORTEDGE DUPLEXFLIPLONGEDGE DV DW E EARLYCHANGE EF
EFF [...] ENDOFBLOCK ENDOFLINE EXTEND EXTGSTATE EXPORT EXPORTSTATE EVENT F FB
FDECODEPARMS FDF FF FFILTER FIELDS FILEATTACHMENT FILESPEC FILTER FIRST
FIRSTCHAR FIRSTPAGE FIT FITH FITV FITR FITB FITBH FITBV FITWINDOW FLAGS
FLATEDECODE FO FONT FONTBBOX FONTDESCRIPTOR FONTFILE FONTFILE2 FONTFILE3
FONTMATRIX FONTNAME FORM FORMTYPE FREETEXT FRM FS FT FULLSCREEN FUNCTION
FUNCTIONS FUNCTIONTYPE GAMMA GBK GLITTER GOTO GOTOE GOTOR GROUP GTS_PDFA1
GTS_PDFX GTS_PDFXVERSION H HEIGHT HELV HELVETICA HELVETICA_BOLD
HELVETICA_OBLIQUE HELVETICA_BOLDOBLIQUE HID HIDE HIDEMENUBAR HIDETOOLBAR
HIDEWINDOWUI HIGHLIGHT I ICCBASED ID IDENTITY IF IMAGE IMAGEB IMAGEC IMAGEI
IMAGEMASK INDEX INDEXED INFO INK [...]
JAVASCRIPT JBIG2DECODE JBIG2GLOBALS JPXDECODE JS K KEYWORDS KIDS L L2R LANG
LANGUAGE LAST LASTCHAR LASTPAGE LAUNCH LENGTH LENGTH1 LIMITS LINE LINK LISTMODE
LOCATION LOCK LOCKED [...]
MARKED MARKINFO MASK MAX MAXLEN MEDIABOX MCID MCR METADATA MIN MK MMTYPE1
MODDATE N N0 N1 N2 N3 N4 NAME NAMED NAMES NEEDAPPEARANCES NEWWINDOW NEXT
NEXTPAGE NM NONE NONFULLSCREENPAGEMODE NUMCOPIES NUMS O OBJ OBJR OBJSTM OC OCG
OCGS OCMD OCPROPERTIES Off OFF ON ONECOLUMN OPEN OPENACTION OP op OPM OPT ORDER
ORDERING OUTLINES OUTPUTCONDITION OUTPUTCONDITIONIDENTIFIER OUTPUTINTENT
OUTPUTINTENTS P PAGE [...]
PROPERTIES PS PUBSEC Q QUADPOINTS R R2L RANGE RC RBGROUPS REASON [...]
SECT SEPARATION SETOCGSTATE SHADING SHADINGTYPE SHIFT_JIS SIG SIGFLAGS SIGREF
SIMPLEX SINGLEPAGE SIZE SMASK SORT SPAN SPLIT SQUARE SQUIGGLY ST STAMP STANDARD
STATE STDCF STEMV STMF STRF STRIKEOUT STRUCTPARENT STRUCTPARENTS STRUCTTREEROOT
STYLE [...]
TWOCOLUMNLEFT TWOCOLUMNRIGHT TWOPAGELEFT TWOPAGERIGHT TX TYPE [...]
USEATTACHMENTS USENONE USEOC USEOUTLINES USER USERPROPERTIES USERUNIT USETHUMBS
V V2 VERISIGN_PPKVS VERSION VIEW VIEWAREA VIEWCLIP VIEWERPREFERENCES VIEWSTATE
VISIBLEPAGES W W2 WC WIDGET WIDTH WIDTHS WIN WIN_ANSI_ENCODING WIPE WHITEPOINT
[...]
For that matter, and to start another flame war,
what is the general approach to automated testing
or various pdf generation or manipulation code?
Can you do nightly builds and check the output
with some automated tool besides a binary file compare?
If you check in new stuff and use it to generate
pdf output, do you just do a binary check of the output
against reference output or hope to find
deviations that are only observable by human
eyes ( and by "eyes" I'm using the term to be
specific. That is, imperceptible output differences
or differences in rendering time or file size would
not be relevant
[ flame- based on some results I've seen, it never has been LOL ]
[ flame- if you are generating output which is designed not
to be machine readable, how do you know if your target audience
is still happy with the result? ]
Thanks.
> Subject: Re: [iText-questions] contribution: FontReplacingPdfSmartCopy:
> duplicate TTF font subset merging and replacement
>
>
> To make sure the contribution isn't overlooked, you can post the
> patch on the SourceForge tracker:
> http://sourceforge.net/tracker/?group_id=15255&atid=315255
>
> And you can also send a scan of the signed CLA.
>
> The items in the tracker will be looked at next week.
> --
_________________________________________________________________
Express your personality in color! Preview and select themes for HotmailĀ®.
http://www.windowslive-hotmail.com/LearnMore/personalize.aspx?ocid=TXT_MSGTX_WL_HM_express_032009#colortheme
------------------------------------------------------------------------------
_______________________________________________
iText-questions mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/itext-questions
Buy the iText book: http://www.1t3xt.com/docs/book.php