Re: [poppler] On PDF Text Extraction

Leonard Rosenthol Wed, 26 Sep 2007 05:40:04 -0700

        From:     [EMAIL PROTECTED]
        Subject:        Re: [poppler] On PDF Text Extraction
        Date:   September 26, 2007 8:29:19 AM EDT
        To:       [EMAIL PROTECTED]


[Some comments - inline]

On Sep 18, 2007, at 7:08 PM, Behdad Esfahbod wrote:

Before I started research that led to this thread, I wrote some
stuff about this, which I now see does not work.  Specifically,
ActualText is not supported in poppler (and possibly other
extractors) at all, so that cannot be part of a portable
solution.

However, use of ActualText is a good idea for a variety of otherreasons and is being recommended as part of PDF/A-2's new "Unicodecompliance level" for those cases where ToUnicode doesn't suffice.

  - It's crucial for the above algorithm to work that a ToUnicode
    entry mapping a glyph to an empty string works.  That is, a
    glyph that maps to zero Unicode characters.

I will verify this, but I am pretty sure that this is invalid. YouMUST have at least one character on the right side of the mapping.

  - Every font may need to have an "empty" glyph, that is most
    useful if zero-width.  This is to be able to include things
    like U+200C ZERO WIDTH NON-JOINER in the extracted text.

I don't understand this. What is the point of having "empty text orglyphs" in the PDF? It's simply not necessary.


        And if you insist on doing this, do NOT use .notdef

    The main problem is that PDF doesn't have an easy way to
    convey bidi information.  There is a ReversedText property
    but it belongs to Tagged PDF part of the spec which is far
    from supported.

There is also the WritingMode tag, which is important not just forRTL but also vertical text. This is an EXTREMELY important tag usedwhen dealing with mixed direction text - or a page consisting ofvarious "block level" elements with varying writing modes.


        You could also use the Lang entry, but that's not really about bidi...

  - A problem about using composite fonts is that when you find
    out that you need a composite font (that is, more than 255
    glyphs of the font should be subsetted), it's too late to
    switch, since you have already output PDF code the previous
    glyphs as single-byte codepoints.  So one ends up using
    composite fonts unconditionally (exception is, if the
    original font has less than 256 glyphs, there's no point in
    using composite fonts at all.).  This slightly wastes space
    as each codepoint will be encoded as four hex bytes instead
    of two.

You could also do some pre-processing of the text, prior torendering, to determine the complete glyph/code-point complementnecessary and then make decisions.

   However, one can use a UTF-8 scheme such that the
    first 128 glyphs are accessed using one byte and so on.  This
    way the PDF generator can use a simple font for subsets with
    less than 128 glyphs.  However, Adrian is telling me that
    Poppler only supports UCS-2 Identity CMap mapping for CID
    font codepoint encoding.  So this may not be feasible.

The only way to do this would be to actually use TWO separate fontsubsets - one that was single byte (for code points < 128) and onethat was double-byte This is because the CMap, ToUnicode tables, etcall expect (according to the PDF Reference) that you only use asingle encoding for the entire font - so it's either 1 byte or two.

  - Shall we use standard encodings if all the used glyphs in a
    subset are in a well-supported standard encoding?  May be
    worth the slight optimization.  Also may make generated
    PS/PDF more readable for the case of simple ASCII text.

I would definitely do this! Makes for smaller PDFs for "Roman-only" documents.

  - Also occurred to me that in PDF almost all objects can come
    after hey are referenced.  Does this mean we can write out
    pages as we go and avoid writing to a temp file that we
    currently do?

        Of course!!   Most PDF generators do this - no temp files required.

  -Some cairo API may be added to allow TaggedPDF
    marking from higher level.  Something like:

        cairo_pdf_marked_content_sequence_t
        cairo_pdf_surface_begin/end_marked_content()

Cairo may wish to support this for other reasons, since markedcontent is used in PDF for a variety of features including optionalcontent (aka Layers), object properties and more.

Anyway, wow, 400 lines.  Thanks for reading this far.  I'm going
to do a presentation on PDF text extraction at the Linux
Foundation OpenPrinting Summit next week in Montréal, mostly based on
this mail:

        If you'd like it reviewed - feel free to send me a copy...

Leonard Rosenthol
PDF Standards Evangelist
Adobe Systems

_______________________________________________
poppler mailing list
[email protected]
http://lists.freedesktop.org/mailman/listinfo/poppler

Re: [poppler] On PDF Text Extraction

Reply via email to