Re: [CODE4LIB] What's the descriptive technical terminology?... pdf image of a page. pdf format used with cut paste.

Jonathan Rochkind Thu, 28 Apr 2011 14:22:15 -0700

Neither vector nor raster information describes the actual embedded_text_ we're talking about though. The stuff that lets youcopy-and-paste _text_ (not images), or search text. PDFs can also havethat. And even know what portions of a raster displayed image correspondto what characters.

text characters in a PDF aren't vector images, they're actuallycharacter bytes encoded with some encoding such as utf-8.


On 4/28/2011 5:00 PM, Carl Wiedemann wrote:

I should also remark that vector information and raster information may
exist in the same PDF file. For example, a PDF of a magazine or newspaper
will probably vector text and column borders while photography will be
raster at ~300dpi.


On Thu, Apr 28, 2011 at 2:58 PM, Carl Wiedemann<[email protected]>wrote:

Generally PDFs are capable of displaying two types of information: Vector
and Raster.

Vector information is composed of lossless data that describes points,
smooth lines, gradients, and curves. Vector information is lossless and has
no native resolution, it can be infinitely scaled. Text data is understood
as vector information if we were to regard textual documents as images.
Generally, when composing a document in a word processor and printing it to
a PDF results in the text as actual vector shapes -- you can zoom-in on the
text as much as you'd like. PDF readers understand this information as
native text you can select the text with a cursor, search the text, and
copy/paste. Other formats like SVG and ESP generally express vector
information.

Raster information is composed of pixels. JPEG, PNG, GIF, BMP, TIFF are
examples of raster information. These have a definite resolution, and, from
a computing perspective, are just a bunch of dots. When you scan an image
(or a document), it is digitally translated a raster. Digital photographs
are raster. There are some techniques using Optical Character Recognition
(OCR) which can actually recognize characters in a raster image and
transform them into text data. There are also procedures to do a "bitmap
trace" to attempt to create vector information from a raster image.

More info here
http://en.wikipedia.org/wiki/Vector_graphics
http://en.wikipedia.org/wiki/Raster_graphics




On Thu, Apr 28, 2011 at 11:10 AM, Van Mil, James (vanmiljf)<
[email protected]>  wrote:

I often employ the word 'raster', along with some other foul language, for
any PDFs that don't have manipulate-able text.

-James

-----Original Message-----
From: Code for Libraries [mailto:[email protected]] On Behalf Of
Keith Jenkins
Sent: Thursday, April 28, 2011 1:06 PM
To: [email protected]
Subject: Re: [CODE4LIB] What's the descriptive technical terminology?...
pdf image of a page. pdf format used with cut paste.

I've also heard many people use the term "searchable PDF" for a text-based
PDF.

Keith

On Thu, Apr 28, 2011 at 12:43 PM, Peter Murray<[email protected]>
wrote:

That is the same terminology I use as well -- image-based versus

text-based. I find that works most times because people can visually see if
something looks like a scanned image.

Re: [CODE4LIB] What's the descriptive technical terminology?... pdf image of a page. pdf format used with cut paste.

Reply via email to