Hi, Just joined the Poppler list. Excuse my ignorance if I don't know it around here.
Anyway, I wrote about PDF text extraction from the point of view of what cairo should be doing to generate perfectly text-extractable PDFs. Forwarding the message here as people may be interested. I also point out a few poppler bugs. I plan to fix them at some point, but it may be an obvious small fix to those familiar with the code base. Regards, behdad -------- Forwarded Message -------- From: Behdad Esfahbod <[EMAIL PROTECTED]> To: cairo <[EMAIL PROTECTED]> Subject: PDF Text Extraction: Future Date: Sun, 16 Sep 2007 19:37:53 -0400 Previously I wrote about the state of text extraction in PDF generated by cairo. This is a follow-up message and uses definitions from the first one. So, please review it before reading on: http://lists.freedesktop.org/archives/cairo/2007-February/009452.html Warning: long long email. Refill your cup of coffee now. ================================================================= Text Extraction, Future: As already mentioned, perfect text extraction is not possible when only the output glyphs are given to the PDF generator. So we need a new API in cairo for that. This new API needs to take both input text and output glyphs, and a mapping between the two. To get into the mapping, we need new terminology: - Cluster: A minimal pair of matching input text and output glyphs. Being minimal, a cluster cannot be broken into two clusters semantically. Most of the time clusters can be grouped as one of the following kinds, though this does not hold necessarily: * Grapheme clusters. A grapheme is a piece of input text / output glyphs that expresses one logical unit of text as identified by the language of the text in question. For example, "ö" is a grapheme. The input text representing it may be U+00F6 LATIN SMALL LETTER O WITH DIAERESIS, or <U+006F LATIN SMALL LETTER O,U+0308 COMBINING DIAERESIS>, and the output glyphs may be a single glyph, or two (one for "o", another for the diaeresis for example). In Grapheme clusters, the input text is associated with the output glyphs as a whole, and it is not desirable to extract part of the input text if only part of the output glyphs are selected. * Ligature clusters. A ligature cluster includes input text for more than one grapheme, but the graphemes interact (typographically or orthographically) such that the output glyphs cannot be divided further for individual graphemes. An example of a ligature cluster that is purely typographical is the "ff" ligature. It maps two input graphemes <"f","f"> to a single glyph "ff". An orthographical ligature cluster is the Arabic Lam-Alef ligature that maps two input graphemes <"ل","ا"> to something that looks like "ﻻ". This is a mandatory ligature in Unicode and its shape is different than that of Lam followed by Alef would otherwise look. A ligature cluster most of the time has only one output glyph, but it doesn't have to. Since ligature clusters contain more than one grapheme, it is perfectly valid for the user to want to select only a subset of the graphemes. There is no general solution to this problem, so most implementations simply divide the width of an output glyph into the number of graphemes linearly. So for example if you mouse over to the middle of the "ff" ligature, only one "f" will be selected... This is far from perfect for some Indic languages, but is good enough. To sum up, a cluster is an indivisible mapping of M input characters to N output glyphs. We show this as M->N. Most clusters are 1->1, but 1->N, M->1, and M->N are all commonly found in more complex text. There are also special cases where one of M or N is zero. A M->0 cluster may be an ASCII tab character, or a U+200C ZERO WIDTH NON-JOINER. We will find a use for 0->N later :-). And a cluster mapping: - Cluster Mapping: A sequence of clusters, each encoded as the number of characters and the number of glyphs that it consumes. The mapping also may have a "backward" flag set on it, in which case the output glyphs will be consumed from the end of the array to the beginning. This is needed in right-to-left runs of text for example. So, our new cairo API takes input text, output glyphs, and cluster mapping. No _extents or _path variants are needed. I have prototyped this call in the show_text_glyphs branch in my tree. Here is the prototype: typedef struct { unsigned char num_bytes; unsigned char num_glyphs; } cairo_text_cluster_t; cairo_public void cairo_show_text_glyphs (cairo_t *cr, const char *utf8, int utf8_len, const cairo_glyph_t *glyphs, int num_glyphs, const cairo_text_cluster_t *clusters, int num_clusters, cairo_bool_t backward); Yes, show_text_glyphs is not the best name in the world, but is much better than show_clusters, show_text_clusters, etc... The implementation tries the show_text_glyphs method of the target surface and if that doesn't work, falls back to show_glyphs. It is important to get the API right for 1.6, such that Pango can move to using it. The actual PDF implementation can be fixed gradually and later. Note that PDF is not the only backend that can make good use of this call. SVG can do similar stuff, and may have an option to throw the glyphs away completely and just embed the text. ================================================================= PDF Implementation: Before I started research that led to this thread, I wrote some stuff about this, which I now see does not work. Specifically, ActualText is not supported in poppler (and possibly other extractors) at all, so that cannot be part of a portable solution. Here is the old post for reference: http://lists.freedesktop.org/archives/cairo/2007-January/009127.html I now have a solid plan that requires changes in cairo, pango, and poppler, but all the changes are gradual and small, and it should work with other viewers/extractors reasonably well too, but I have not tested them yet. The basic idea is that instead of allocating one code-point per unique glyph, we now will allocate one code-point per unique M->1 clusters. The reason is that, as I summarized in the first mail, each code point maps to exactly one glyph and a string of text, using the ToUnicode mapping. This seemingly obvious idea solves a lot of problems. First, we don't have to care about what the default character for a glyph is. We just don't care! So here is a first pass on the algorithm: for each cluster: - if it's M->0, add the index of an "empty" glyph to the cluster output glyphs and act as if it's M->1. - if it's M->1, search for this cluster in current code-points, and allocate a new code-point if not found, mapping to the single glyph involved and the input text (using ToUnicode). - otherwise it's M->N where N > 1. Break the cluster into N pseudo-clusters, each having exactly one glyph, and 0 or more input characters. The break can be done linearly, or taking glyph advances or positions into account. For each resulting pseudo-cluster, act as if it's a M->1 cluster. Discussion: - It's crucial for the above algorithm to work that a ToUnicode entry mapping a glyph to an empty string works. That is, a glyph that maps to zero Unicode characters. Nowhere in the CMap spec I found whether this is allowed or not. I don't remember if Acrobat handles it correctly, but poppler has some bugs: when you select such a character, it's not rendered correctly and it outputs a U+FFFD in the extracted text instead of nothing. A nasty hack to relax this is to output an "ignorable" Unicode character instead. Something like U+2060 WORD JOINER, or if that affects Indic shaping, a nastier one like U+2063 INVISIBLE SEPARATOR, or something equally useless and ignorable (general category Cf). Another way around it if many viewers don't support it is to add new composite glyphs to the font that combine multiple glyphs into one, so we don't have to use 0->N clusters. Some font formats may not support composite glyphs however. - Every font may need to have an "empty" glyph, that is most useful if zero-width. This is to be able to include things like U+200C ZERO WIDTH NON-JOINER in the extracted text. As these normally don't map to any glyphs, and every code-point maps to exactly one glyph, we need a dummy glyph to be able to preserve such invisible character. This also includes things like ASCII tab character. It should be trivial to add such a glyph in all font formats. - When selecting a M->1 cluster, poppler divides a glyph's width linearly into M parts, one for each character. This is similar to what Pango does, except that Pango only allows selection at grapheme boundaries, not each character boundary. There does not seem to be any way to convey this information in PDF. But viewers can be overcome this rather easily: the grapheme boundary (aka cursor-position) information is font independent and only depends on the input text (plus Unicode Character Database). So, Poppler (and other viewers) can be made to call into Pango or similar systems to find out grapheme boundaries and allow selection only at those points. - Backward: the backward mapping flag is needed as part of the cluster mapping, but it may as well be used to ensure that extracted bidi text behaves the same way as it was intended. The main problem is that PDF doesn't have an easy way to convey bidi information. There is a ReversedText property but it belongs to Tagged PDF part of the spec which is far from supported. We will assume that glyphs in a PDF are always layed out in the visual order, that is, from left to right. There is nothing preventing a library generating glyphs that have a negative advance width and so go in the logical order for right-to-left text, but it's not common practice and most probably not very well supported. So we'll stay with LTR glyphs for now. The problem then is, how a viewer like Poppler should do the reverse-bidi step to get to the logical text, from the visual text extracted from the glyphs. Poppler currently does a heuristic based on the Bidi direction of characters extracted, and while it does a good job at it, lets just say that it doesn't do it very correctly. What it does is, it finds maximal runs of RTL chars, reverses the text extracted for that run, and sandwich it between a U+202B RIGHT-TO-LEFT EMBEDDING and U+202C POP DIRECTIONAL FORMATTING pair. Obvious problems with this are: o The hard thing about reverse-bidi is how to handle "neutral" characters: those without a strong LTR or RTL direction. Poppler doesn't try to group them with adjacent RTL runs. o With a run of all-RTL chars, there's absolutely no need to put RLE/PDF around it. o Instead of reversing the glyphs and then extracting text from them and append, it extracts text first and then reverse. So if a glyph maps to two or more characters, those come out backward in the extracted text, which is wrong. We can do better by making some assumptions about the generated PDF, and working a bit harder. The assumption to make about the PDF is that each text show operation contains a run of text with exactly one direction. That is, assume that the text layout engine broke text into runs based on direction among other things. This holds true for virtually all text layout engines I know. Another way to improve the situation is to break text runs into hopefully less ambiguous ones in Pango, to make the job of the extractor easier. This can be used for example if both LTR and RTL characters are available in a text run. Then the text extractor does this: - For every line, extract text from all segments, mark each segment as LTR or RTL or neutral based o the type of chars found into it. - Group adjacent runs of the same type together. - Resolve neutral groups to neighboring group if both are of the same type, otherwise resolve to "base direction" of the line (to be devised). - Do some special-casing for digits, etc. This is really about writing a sophisticated reverse-bidi algorithm, something that I've wanted to do for such a long time and I may as well tackle it this time. Lacking that, I have observed before that, a good approximation to a reverse-bidi algorithm is the bidi algorithm itself. So one may feel lucky and use pango_log2vis_get_embedding_levels() directly (or from FriBidi). It's mostly about getting the details right, and I can't say more before getting down to code. One complication is that the text stream may have bidi override marks itself as we now can and want to embed them in the text stream. - Other poppler issues I faced: o When a glyph maps to more than one character, when part of the glyph is selected, it renders the glyph multiple times. Should be obvious to fix. o When a glyph maps to zero chars, it doesn't render the glyph in selection and copies a space character instead. More important is to see how Adobe Reader handles it. o Has serious problems copying combining marks positioned using the Td operation. o It does a hefty job at column detection which is good, but has serious problems. What I understand from reading the code is that it sorts all the glyphs in a visual order (top to bottom, left to right), and processes them in that order. It should instead work on a text-operation level for the least. My guess is that simply processing text operations in the running order in the PDF file produces much better results, but then again there probably have been good reason to write the current code in the first place. - Thinking about implementation, ToUnicode tables are independent of font formats, so one piece of code can generate them nicely, which is mostly the case already. But the codepoint-to-glyph mapping lives in different places for different font types. So the code to handle it is spread over both cairo-*-subset.c files and cairo-*-surface.c files. But given all the PDF expertise we have thanks to Adrian's awesome work, I wouldn't worry much about implementation :-). - Pango also needs a new method in PangoRenderer called draw_item which is the Pango equivalent of show_text_glyphs. That's easy. A harder issue is to make Pango render runs in a line in the logical order as opposed to the current visual order. That's a bit harder because of the way PangoLayoutLine struct has been made public and can't be changed. With runs rendered in logical order, viewers that simply follow PDF order work best, and those sorting runs/glyphs based on their position will not work worse than they currently do. ================================================================= Other Issues: Some other random thought about PDF text output that is not necessarily related to the text extraction problem, but passed through my mind during these experiments: - A problem about using composite fonts is that when you find out that you need a composite font (that is, more than 255 glyphs of the font should be subsetted), it's too late to switch, since you have already output PDF code the previous glyphs as single-byte codepoints. So one ends up using composite fonts unconditionally (exception is, if the original font has less than 256 glyphs, there's no point in using composite fonts at all.). This slightly wastes space as each codepoint will be encoded as four hex bytes instead of two. This is assuming the simple UCS-2 Identity CMap codepoint encoding that maps two bytes to a glyph value of 0..65535. However, one can use a UTF-8 scheme such that the first 128 glyphs are accessed using one byte and so on. This way the PDF generator can use a simple font for subsets with less than 128 glyphs. However, Adrian is telling me that Poppler only supports UCS-2 Identity CMap mapping for CID font codepoint encoding. So this may not be feasible. It does support arbitrary CMap mappings for ToUnicode tables though, so the code certainly is there... The PDF spec also says that: Acrobat Versions 4.0 and 5.0 (PDF Versions 1.3 and 1.4, respectively) must use “ToUnicode” mapping files that are restricted to UCS-2 (Big Endian) encoding, which is equivalent to UTF-16BE encoding without Surrogates. (Note how the "fi" in "files" is encoded as a U+FB01 LATIN SMALL LIGATURE FI because I copy/pasted from the spec.) - Shall we use standard encodings if all the used glyphs in a subset are in a well-supported standard encoding? May be worth the slight optimization. Also may make generated PS/PDF more readable for the case of simple ASCII text. - Also occurred to me that in PDF almost all objects can come after hey are referenced. Does this mean we can write out pages as we go and avoid writing to a temp file that we currently do? - TaggedPDF allows for a lot more, but it's very hard to for example mark all paragraphs and pages and all. There's also a ReversedText tag in there. In most cases though, seems like we can do the same without it as far as text extraction is concerned. Some cairo API may be added to allow TaggedPDF marking from higher level. Something like: cairo_pdf_marked_content_sequence_t cairo_pdf_surface_begin/end_marked_content() - Other possibly useful PDF APIs will be for embedding arbitrary objects, for marking ActualText around arbitrary drawing operations, and to get object number for any embedded fonts, patterns, etc. We are still waiting for someone more familiar with PDF and higher-level use-cases to tell us what exactly they need from cairo... ================================================================= Anyway, wow, 400 lines. Thanks for reading this far. I'm going to do a presentation on PDF text extraction at the Linux Foundation OpenPrinting Summit next week in Montréal, mostly based on this mail: http://www.linux-foundation.org/en/OpenPrinting/SummitMontreal I try to hack something up this week, but most probably I just get to do the Pango side and wait for some PDF ninja to do the cairo side. You know who you are! Cheers, -- behdad http://behdad.org/ "Those who would give up Essential Liberty to purchase a little Temporary Safety, deserve neither Liberty nor Safety." -- Benjamin Franklin, 1759 _______________________________________________ poppler mailing list [email protected] http://lists.freedesktop.org/mailman/listinfo/poppler
