From: [EMAIL PROTECTED]
Subject: Re: [poppler] On PDF Text Extraction
Date: September 26, 2007 8:29:19 AM EDT
To: [EMAIL PROTECTED]
[Some comments - inline]
On Sep 18, 2007, at 7:08 PM, Behdad Esfahbod wrote:
Before I started research that led to this thread, I wrote some
stuff about this, which I now see does not work. Specifically,
ActualText is not supported in poppler (and possibly other
extractors) at all, so that cannot be part of a portable
solution.
However, use of ActualText is a good idea for a variety of other
reasons and is being recommended as part of PDF/A-2's new "Unicode
compliance level" for those cases where ToUnicode doesn't suffice.
- It's crucial for the above algorithm to work that a ToUnicode
entry mapping a glyph to an empty string works. That is, a
glyph that maps to zero Unicode characters.
I will verify this, but I am pretty sure that this is invalid. You
MUST have at least one character on the right side of the mapping.
- Every font may need to have an "empty" glyph, that is most
useful if zero-width. This is to be able to include things
like U+200C ZERO WIDTH NON-JOINER in the extracted text.
I don't understand this. What is the point of having "empty text or
glyphs" in the PDF? It's simply not necessary.
And if you insist on doing this, do NOT use .notdef
The main problem is that PDF doesn't have an easy way to
convey bidi information. There is a ReversedText property
but it belongs to Tagged PDF part of the spec which is far
from supported.
There is also the WritingMode tag, which is important not just for
RTL but also vertical text. This is an EXTREMELY important tag used
when dealing with mixed direction text - or a page consisting of
various "block level" elements with varying writing modes.
You could also use the Lang entry, but that's not really about bidi...
- A problem about using composite fonts is that when you find
out that you need a composite font (that is, more than 255
glyphs of the font should be subsetted), it's too late to
switch, since you have already output PDF code the previous
glyphs as single-byte codepoints. So one ends up using
composite fonts unconditionally (exception is, if the
original font has less than 256 glyphs, there's no point in
using composite fonts at all.). This slightly wastes space
as each codepoint will be encoded as four hex bytes instead
of two.
You could also do some pre-processing of the text, prior to
rendering, to determine the complete glyph/code-point complement
necessary and then make decisions.
However, one can use a UTF-8 scheme such that the
first 128 glyphs are accessed using one byte and so on. This
way the PDF generator can use a simple font for subsets with
less than 128 glyphs. However, Adrian is telling me that
Poppler only supports UCS-2 Identity CMap mapping for CID
font codepoint encoding. So this may not be feasible.
The only way to do this would be to actually use TWO separate font
subsets - one that was single byte (for code points < 128) and one
that was double-byte This is because the CMap, ToUnicode tables, etc
all expect (according to the PDF Reference) that you only use a
single encoding for the entire font - so it's either 1 byte or two.
- Shall we use standard encodings if all the used glyphs in a
subset are in a well-supported standard encoding? May be
worth the slight optimization. Also may make generated
PS/PDF more readable for the case of simple ASCII text.
I would definitely do this! Makes for smaller PDFs for "Roman-
only" documents.
- Also occurred to me that in PDF almost all objects can come
after hey are referenced. Does this mean we can write out
pages as we go and avoid writing to a temp file that we
currently do?
Of course!! Most PDF generators do this - no temp files required.
-Some cairo API may be added to allow TaggedPDF
marking from higher level. Something like:
cairo_pdf_marked_content_sequence_t
cairo_pdf_surface_begin/end_marked_content()
Cairo may wish to support this for other reasons, since marked
content is used in PDF for a variety of features including optional
content (aka Layers), object properties and more.
Anyway, wow, 400 lines. Thanks for reading this far. I'm going
to do a presentation on PDF text extraction at the Linux
Foundation OpenPrinting Summit next week in Montréal, mostly based on
this mail:
If you'd like it reviewed - feel free to send me a copy...
Leonard Rosenthol
PDF Standards Evangelist
Adobe Systems
_______________________________________________
poppler mailing list
[email protected]
http://lists.freedesktop.org/mailman/listinfo/poppler