[iText-questions] PDF "philosophy" (was RE: modifed sample, question on PDF contents)

Leonard Rosenthol Tue, 10 Mar 2009 06:26:40 -0700

BMP (which isn't an open standard, of course) is just a raster image - like 
JPEG, PNG, TIFF, etc.  It doesn't have any concept of text or vector elements - 
let alone any other type of content...


SVG is a much better analogy - especially since SVG is derived from PGML which 
was created by Adobe to represent the "PDF imaging model" in XML. In both SVG 
and (untagged) PDF you have specific graphical elements with explicit (either 
absolute or relative) positioning on a "canvas" - no concept of how these 
elements are organized.  

To draw a string in SVG at 10,10, it's "<text transform="1 0 0 1 10 10">Some 
text goes here</text>".  In PDF, it's "1 0 0 1 10 10 tm (Some text goes here) 
Tj".  Looks similar - as noted above, it should!

In Tagged PDF, these elements can be grouped together into logical blocks, such 
as "/BMC /H1 1 0 0 1 10 10 tm (Some text goes here) Tj /EMC". In this example, 
I made that text an "H1" (aka Header Level 1, just like HTML).   So syntax is 
different, but concepts are the same.

So why use PDF over SVG - many reasons.  The biggest technical reason is 
multiple pages!  SVG is a single page format, while PDF supports multiple. But 
the main reasons are practical - all of the major authoring tools support PDF 
and not SVG - and 99% of the world has a PDF viewer on their computer/phone but 
not an SVG viewer.

>Well, the FDA publishes clinical trial data for approved drugs in formats that 
>include scanned PDF files, which are pretty much
>useless for any real analysis by outside entities even with decent OCR 
>software. 
>
That's usually because that is how the information is received from the drug 
company.  The FDA doesn't require "computer readable" information and so drug 
companies aren't going to "give away" their hard earned information if they 
don't have to.  

>The FCC, last time I looked, even accepts submissions that disallow extraction 
>of images or text. 
>
I'd be surprised if that were the case - but I haven't looked recently either...

>
> And what types of "manipulation" are you expecting? Some documents aren't 
> designed for manipulation, such as the plans for a Sherman Tank - while 
> others, such as forms make sense to enable extraction and processing of the 
> data.

My "plans for a Sherman Tank" example is, believe it or not, a REAL PDF that I 
have seen at the DOD.  Also, companies such as Boeing and Airbus also produce 
manuals for every plane they produce in PDF - with full technical drawings of 
each part.  So no - not a flippant example, but a real and true one.  However, 
I agree with you that such information needs to be both human and computer 
readable - which is why PDF supports BOTH rich rendering AND rich semantics for 
all forms of content.  In fact, it's the ONLY format that supports both!  (yes, 
PDF supports structure and metadata for vector and even 3D information to be 
incorporated!) 

>I'd like to be able to maintain my own tax information and
>extract it from a filled in 1040 and not just waste time typing
>into an information black hole in some proprietary or unworkable
>format. 
>
        PDF isn't a proprietary format - it's an open international standard 
(ISO 32000-1).  Can't get more "non-proprietary" than that!!

        But on the more general issue, what you are running into are decisions 
by the government that they can (and do!) make $$ selling the tax tables - and 
as such, there is no incentive for them to put that information into a format 
that "just anyone" can access.  However, if you do license the information from 
them - you get it in machine readable format.  That's capitalism - not 
technical ;).


Leonard


-----Original Message-----
From: Mike Marchywka [mailto:marchy...@hotmail.com] 
Sent: Tuesday, March 10, 2009 8:13 AM
To: itext-questions@lists.sourceforge.net
Subject: Re: [iText-questions] modifed sample, question on PDF contents



As a newcomer to the list I'm not sure how apropos this
is but until I hear otherwise I'll assume it is ok.
This is probably more political than itext relevant.

----------------------------------------
> From: lrose...@adobe.com
> To: itext-questions@lists.sourceforge.net
> Date: Tue, 10 Mar 2009 04:34:57 -0700
> Subject: Re: [iText-questions] modifed sample, question on PDF contents
>
> You need to consider the history of PDF...
>
> The original design was for "electronic paper" - something where you could 
> create a "frozen instance" of your document that would look the same on any 
> computer and print as it looked. As such, there was no need to incorporate 
> semantic information about the structure of the document - only information 
> necessary to render it.

Isn't this what a BMP file is (LOL)? I have to admit that
my experience with Reader 7 on Win 2K and other attributes
of the format left me searching for any other alternatives.
Everytime I say or write "PDF" I still think of scanned
documents that look like they came in over a FAX machine. 

I guess a more appropriate comparison, rather than BMP,
could be your SVG approach- all you have here is glyphs
instead of shapes. For artwork or pictures, this is fine but
not for information that is more accurately textual. 
When would someone decide to publish a PDF file instead of
an SVG "document?"


>
> However, as the use of PDF developed it became clear that there was a need to 
> also incorporate structural/semantic information to be able to make use of 
> the content in a consistent fashion (vs. having to "guess", and everyone 
> guessing differently) and thus the tagging/structure features were added in 
> PDF 1.4. Unfortunately, not all PDF producers will put such information into 
> the file :(. Like any format, "garbage in, garbage out".
>
> What type of government documents are you talking about? Different 
> departments create different types of documents, and those, of course, vary 
> country to country. Consider in the USA, you have tax forms from the IRS, 
> transcripts from Congress, technical materials from the DOD, etc.

Well, the FDA publishes clinical trial data for approved drugs
in formats that include scanned PDF files, which are pretty much
useless for any real analysis by outside entities even with decent
OCR software. The FCC, last time I looked, even accepts submissions
that disallow extraction of images or text. Fortunately I 
haven't seen a PDF submission in the SEC company filings in a long
time and they have even gone to XBRL XML filings. 

Computers may be able to  automate data processing, not just  
remove information. A recent summary of my attitude with
limited references is here, buried in with some other topics, if you are 
interested,

http://www.sec.gov/comments/s7-04-09/s70409-2.pdf

[ note that I did not submit this as a PDF file, LOL  ] 


>
> And what types of "manipulation" are you expecting? Some documents aren't 
> designed for manipulation, such as the plans for a Sherman Tank - while 
> others, such as forms make sense to enable extraction and processing of the 
> data.

While I'm sure this is just a flippant example ( as I often
give LOL), it does illustrate this presumption that
people need or want pictures/limited dat, not robust model 
information when
in fact the opposite would be true with this example. 
You might want to restrict access but this is actually a 
perfect example of where you NEED automated interaction with
information and pictures/views/renderings are really not 
the main issue. An image document like PDF or a 
screen shot from a CAD system
is not what you want to store and manipulate plans.
"Plans" would require even more versatile machine
readability with human readability being just a small component.
Presumably, you would like to archive, manipulate, and reuse
pieces and partially assembled units and make these things
automatically from the plans. At minimum, something like
a CNC mill or automated material ordering system would have to
"read" the plans. 

The US IRS offers PDF tax forms.
I'd like to be able to maintain my own tax information and
extract it from a filled in 1040 and not just waste time typing
into an information black hole in some proprietary or unworkable
format. Taxes are mostly numbers, and numbers can be manipulated
for many purposes if not buried in a bunch of irrelevant formatting
information. I'd probably cry if I found out the IRS bought
special scanner equipment and high-speed printers to print electronic
submissions only so they could be scanned back in just because
the PDF format doesn't let them separate information from graphics.
But, I also would not be surprised if that is exactly what they do.




------------------------------------------------------------------------------
_______________________________________________
iText-questions mailing list
iText-questions@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/itext-questions

Buy the iText book: http://www.1t3xt.com/docs/book.php

[iText-questions] PDF "philosophy" (was RE: modifed sample, question on PDF contents)

Reply via email to