[Bug 59870] FILEOPEN PDF: Incorrect text encoding

bugzilla-daemon Fri, 11 Apr 2025 11:13:50 -0700

https://bugs.documentfoundation.org/show_bug.cgi?id=59870


--- Comment #26 from Eyal Rozenberg <[email protected]> ---
(In reply to Khaled Hosny from comment #25)
> The PDF metadata shows that it was produce by Ghostscript. The PDF font
> dictionaries contain no ToUnicode CMaps, nor do they use any standard PDF
> font encoding. As such there are no much textual data that can be extracted
> from the PDF.

But the text is _there_... I'm no PDF expert (nor even have a decent tool for
exploring PDF files' raw structure), but - if the encoding is iso-8859-1, or
something similar - should we not be able to figure this out? Especially given
the hint of lack-of-CMaps, rather than jarbled CMaps?

> That is a case of bad PDF producer (or at least PDF not
> intended to be preserve textual data), and we can’t do anything to extract
> data that do not exist.

But there is text, isn't there? So, can we really not do anything?

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 59870] FILEOPEN PDF: Incorrect text encoding

Reply via email to