https://bugs.documentfoundation.org/show_bug.cgi?id=119606

خالد حسني <kha...@aliftype.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |kha...@aliftype.com
                 OS|Linux (All)                 |All
           Hardware|x86-64 (AMD64)              |All

--- Comment #8 from خالد حسني <kha...@aliftype.com> ---
This highly depends on font and the PDF viewer used, and limitations of PDF
format.

We are doing our best with what PDF format gives us, we are outputting
ToUnicode mapping when applicable and ActualText tagging when not. We try to
limit the scope of ActualText spans so that individual characters and words can
be selected and highlighted, otherwise we can tag full paragraphs with
ActualText which will give the most fidelity in preserving the textual content,
but then PDF viewers will treat the paragraph text as back box and can no
longer associate the text with the glyphs rendered (so search results can’t be
highlighted, parts of the paragraph can’t be selected and so on).

PDF is not an archival format, no matter how hard Adobe wants to sell this
idea, it is first and foremost a print format, a glorified paper so to speak.

We are crippled by several issues here:

* Text in PDF is output in visual order (i.e. from left to right), while the
text content is stored in logical order (the first character comes first in
memory, regardless of the direction). This means any tool extracting text from
PDF need to reverse the logical to visual order and this process lossy and not
always reliable.

* PDF stores glyphs not characters, so we need to handle all the complex glyph
to character relationships, that is why the result depends on the font.

* Not all PDF viewers support ActualText tagging, and the ToUnicode mechanism
can’t capture all the possible relations above.

* PDF viewers will often try to guess where the spaces are since many PDF
producing tools don’t output space character at all (they just position the
glyphs so that they are separated visually by blank space), so sometimes
kerning can be misrepresented as word spaces.

Overall I don’t think there is anything that can be done here, but if someone
can attach a PDF that is doing better, I can try to have a look and see if we
can learn some trick from it.

Lastly, none of this is platform dependent, if you are getting different
results on different platforms, it will be either because the different fonts
or PDF viewers used.

-- 
You are receiving this mail because:
You are the assignee for the bug.

Reply via email to