BMP (which isn't an open standard, of course) is just a raster image - like JPEG, PNG, TIFF, etc. It doesn't have any concept of text or vector elements - let alone any other type of content...
SVG is a much better analogy - especially since SVG is derived from PGML which was created by Adobe to represent the "PDF imaging model" in XML. In both SVG and (untagged) PDF you have specific graphical elements with explicit (either absolute or relative) positioning on a "canvas" - no concept of how these elements are organized. To draw a string in SVG at 10,10, it's "<text transform="1 0 0 1 10 10">Some text goes here</text>". In PDF, it's "1 0 0 1 10 10 tm (Some text goes here) Tj". Looks similar - as noted above, it should! In Tagged PDF, these elements can be grouped together into logical blocks, such as "/BMC /H1 1 0 0 1 10 10 tm (Some text goes here) Tj /EMC". In this example, I made that text an "H1" (aka Header Level 1, just like HTML). So syntax is different, but concepts are the same. So why use PDF over SVG - many reasons. The biggest technical reason is multiple pages! SVG is a single page format, while PDF supports multiple. But the main reasons are practical - all of the major authoring tools support PDF and not SVG - and 99% of the world has a PDF viewer on their computer/phone but not an SVG viewer. >Well, the FDA publishes clinical trial data for approved drugs in formats that >include scanned PDF files, which are pretty much >useless for any real analysis by outside entities even with decent OCR >software. > That's usually because that is how the information is received from the drug company. The FDA doesn't require "computer readable" information and so drug companies aren't going to "give away" their hard earned information if they don't have to. >The FCC, last time I looked, even accepts submissions that disallow extraction >of images or text. > I'd be surprised if that were the case - but I haven't looked recently either... > > And what types of "manipulation" are you expecting? Some documents aren't > designed for manipulation, such as the plans for a Sherman Tank - while > others, such as forms make sense to enable extraction and processing of the > data. My "plans for a Sherman Tank" example is, believe it or not, a REAL PDF that I have seen at the DOD. Also, companies such as Boeing and Airbus also produce manuals for every plane they produce in PDF - with full technical drawings of each part. So no - not a flippant example, but a real and true one. However, I agree with you that such information needs to be both human and computer readable - which is why PDF supports BOTH rich rendering AND rich semantics for all forms of content. In fact, it's the ONLY format that supports both! (yes, PDF supports structure and metadata for vector and even 3D information to be incorporated!) >I'd like to be able to maintain my own tax information and >extract it from a filled in 1040 and not just waste time typing >into an information black hole in some proprietary or unworkable >format. > PDF isn't a proprietary format - it's an open international standard (ISO 32000-1). Can't get more "non-proprietary" than that!! But on the more general issue, what you are running into are decisions by the government that they can (and do!) make $$ selling the tax tables - and as such, there is no incentive for them to put that information into a format that "just anyone" can access. However, if you do license the information from them - you get it in machine readable format. That's capitalism - not technical ;). Leonard -----Original Message----- From: Mike Marchywka [mailto:marchy...@hotmail.com] Sent: Tuesday, March 10, 2009 8:13 AM To: itext-questions@lists.sourceforge.net Subject: Re: [iText-questions] modifed sample, question on PDF contents As a newcomer to the list I'm not sure how apropos this is but until I hear otherwise I'll assume it is ok. This is probably more political than itext relevant. ---------------------------------------- > From: lrose...@adobe.com > To: itext-questions@lists.sourceforge.net > Date: Tue, 10 Mar 2009 04:34:57 -0700 > Subject: Re: [iText-questions] modifed sample, question on PDF contents > > You need to consider the history of PDF... > > The original design was for "electronic paper" - something where you could > create a "frozen instance" of your document that would look the same on any > computer and print as it looked. As such, there was no need to incorporate > semantic information about the structure of the document - only information > necessary to render it. Isn't this what a BMP file is (LOL)? I have to admit that my experience with Reader 7 on Win 2K and other attributes of the format left me searching for any other alternatives. Everytime I say or write "PDF" I still think of scanned documents that look like they came in over a FAX machine. I guess a more appropriate comparison, rather than BMP, could be your SVG approach- all you have here is glyphs instead of shapes. For artwork or pictures, this is fine but not for information that is more accurately textual. When would someone decide to publish a PDF file instead of an SVG "document?" > > However, as the use of PDF developed it became clear that there was a need to > also incorporate structural/semantic information to be able to make use of > the content in a consistent fashion (vs. having to "guess", and everyone > guessing differently) and thus the tagging/structure features were added in > PDF 1.4. Unfortunately, not all PDF producers will put such information into > the file :(. Like any format, "garbage in, garbage out". > > What type of government documents are you talking about? Different > departments create different types of documents, and those, of course, vary > country to country. Consider in the USA, you have tax forms from the IRS, > transcripts from Congress, technical materials from the DOD, etc. Well, the FDA publishes clinical trial data for approved drugs in formats that include scanned PDF files, which are pretty much useless for any real analysis by outside entities even with decent OCR software. The FCC, last time I looked, even accepts submissions that disallow extraction of images or text. Fortunately I haven't seen a PDF submission in the SEC company filings in a long time and they have even gone to XBRL XML filings. Computers may be able to automate data processing, not just remove information. A recent summary of my attitude with limited references is here, buried in with some other topics, if you are interested, http://www.sec.gov/comments/s7-04-09/s70409-2.pdf [ note that I did not submit this as a PDF file, LOL ] > > And what types of "manipulation" are you expecting? Some documents aren't > designed for manipulation, such as the plans for a Sherman Tank - while > others, such as forms make sense to enable extraction and processing of the > data. While I'm sure this is just a flippant example ( as I often give LOL), it does illustrate this presumption that people need or want pictures/limited dat, not robust model information when in fact the opposite would be true with this example. You might want to restrict access but this is actually a perfect example of where you NEED automated interaction with information and pictures/views/renderings are really not the main issue. An image document like PDF or a screen shot from a CAD system is not what you want to store and manipulate plans. "Plans" would require even more versatile machine readability with human readability being just a small component. Presumably, you would like to archive, manipulate, and reuse pieces and partially assembled units and make these things automatically from the plans. At minimum, something like a CNC mill or automated material ordering system would have to "read" the plans. The US IRS offers PDF tax forms. I'd like to be able to maintain my own tax information and extract it from a filled in 1040 and not just waste time typing into an information black hole in some proprietary or unworkable format. Taxes are mostly numbers, and numbers can be manipulated for many purposes if not buried in a bunch of irrelevant formatting information. I'd probably cry if I found out the IRS bought special scanner equipment and high-speed printers to print electronic submissions only so they could be scanned back in just because the PDF format doesn't let them separate information from graphics. But, I also would not be surprised if that is exactly what they do. ------------------------------------------------------------------------------ _______________________________________________ iText-questions mailing list iText-questions@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/itext-questions Buy the iText book: http://www.1t3xt.com/docs/book.php