Re: [NTG-context] ActualText

Barry Schwartz Sat, 19 Sep 2009 12:54:40 -0700

Arthur Reutenauer <[email protected]> skribis:
>   He means "ActualText tags" :-)  See the PDF spec section 14.9.4, page 623.
> It's a more generic way to support searching than ToUnicode vectors: you just
> specify the actual string of underlying Unicode characters.  The PDF spec uses
> hyphenated "ck" in German as an example: you typeset "Druk-ker" but you want 
> to
> search for "Drucker".  You can't do that with ToUnicode vectors.


You also need ActualText tags to mark the difference between a
discretionary hyphen and an explicit hyphen in English, which programs
like Reader use when extracting text. When the hyphen is discretionary
you set the ActualText to Unicode AD instead of 2D. (That's mentioned
somewhere in the PDF spec.)

Another thing I just thought of that isn't always done is that there
should be explicit space characters between words, including at the
ends of lines, although I'm not sure whether Adobe Reader turns off
its word-boundary heuristics if it sees space characters.

Since what I enjoy doing is making e-books that can be searched
through and, perhaps more importantly, extracted from via the Select
tool, it's important to me to make the search, selection, and
extraction features work. I'll use them myself if I choose, for
instance, to quote from an e-book I made. I've added them in my
(heavily) modified version of ant, but that's in a primitive state, a
long-term project that competes with font-making and e-book-making for
time, and so I'd like to have ConTeXt as well. I like ConTeXt a lot.

Also, I noticed when playing around with the examples from the "Th"
ligature discussion that searching and extraction didn't work with
small caps, though it did work with the ligature. With ActualText tags
these things always work, regardless of the ToUnicode map's
contents. The way Cairo's PDF backend handles this is to use an
ActualText tag for any glyphs that aren't included in the font's
encoding. What I did in my modified ant is to generate a ToUnicode map
from the Adobe glyph naming convention
(http://www.adobe.com/devnet/opentype/archives/glyph.html) and then
put an ActualText tag on anything that happens not to match what you
would get from the ToUnicode mapping.

(For reasons that were stupid, I once created a lame little C library
to do the mapping from glyph names to Unicode, using a compressed
lookup trie:
http://code.google.com/p/kompostilo/source/browse/#svn/trunk/support-libraries/glyph_name
)

___________________________________________________________________________________
If your question is of interest to others as well, please add an entry to the 
Wiki!

maillist : [email protected] / http://www.ntg.nl/mailman/listinfo/ntg-context
webpage  : http://www.pragma-ade.nl / http://tex.aanhet.net
archive  : https://foundry.supelec.fr/projects/contextrev/
wiki     : http://contextgarden.net
___________________________________________________________________________________

Re: [NTG-context] ActualText

Reply via email to