https://bugs.documentfoundation.org/show_bug.cgi?id=161514
Bug ID: 161514
Summary: Invalid unicode mappings in PDF output for combining
diacritics (regression in 24.2)
Product: LibreOffice
Version: 24.2.4.2 release
Hardware: All
OS: All
Status: UNCONFIRMED
Severity: normal
Priority: medium
Component: Writer
Assignee: [email protected]
Reporter: [email protected]
Description:
When exporting documents containing Unicode combining diacritics from Writer
24.2 to PDF, invalid character mappings are generated. This means that copying
text from the PDF or converting it to text gives incorrect output. This is
because there is a mismatch between the text content stream and the unicode
mapping in the output of 24.2. We'll use
The unicode mapping itself is probably okay (though different from 7.4.7.2), it
has regrouped the grapheme cluster into a single code, which is probably a good
thing. The relevant parts are:
/CMapType 2 def
1 begincodespacerange
<00> <FF>
endcodespacerange
2 beginbfchar
<01> <0078030C>
<03> <0075>
endbfchar
Here we see <01> is mapped to U+0078 U+030C, that is, the grapheme x̌, while
<03> is mapped to U+0075, that is, the grapheme u.
The PDF is internally very different in 24.2 as the text "x̌ux̌ux̌ux̌" has been
tagged as 4 separate spans (though all in the same marked content section).
For each instance of "x̌u" (which is U+0078 U+030C U+0075) we get something like
(note the hex UTF-16-BE in ActualText):
/Span<</ActualText<FEFF0078030C>>>
BDC
56.8 668.1 Td /F1 72 Tf[<01>243<02>]TJ
EMC
1 0 0 1 92.8 668.1 Tm
/F1 72 Tf<03>Tj
The problem is that <02> is not defined in the unicode map, so right away we
get an undefined character (space or tofu) when extracting text, and I'm not at
all sure what 243 is supposed to correspond to.
This is a regression as 7.4.7.2 does not show this behaviour. The PDF
internals there are much more straightforward, the cmap contains:
/CMapType 2 def
1 begincodespacerange
<00> <FF>
endcodespacerange
3 beginbfchar
<01> <0078>
<02> <030C>
<03> <0075>
endbfchar
endcmap
And the text stream is just:
56.8 668.1 Td /F1 72
Tf[<01>243<02>-242<0301>243<02>-242<0301>243<02>-242<0301>243<02>]TJ
The problem doesn't seem to be related to Tagged PDF or PDF/A, since I get the
same weird output from 24.2 when I disable them in exporting.
Steps to Reproduce:
1. Create a document with a some unicode combining diacritics, e.g. x̌ux̌ux̌ux̌
(x
+ U+030C Combining Caron)
2. Export to PDF
3. Copy and paste text from the PDF (or run pdftotext)
Actual Results:
got the output: x̌ ux̌ ux̌ ux̌
(either space or tofu between x̌ and u, corresponding to the missing <02>
character)
Expected Results:
expect the output: x̌ux̌ux̌ux̌
Reproducible: Always
User Profile Reset: No
Additional Info:
Version: 24.2.3.2 (X86_64) / LibreOffice Community
Build ID: 420(Build:2)
CPU threads: 4; OS: Linux 6.1; UI render: default; VCL: gtk3
Locale: en-CA (en_CA.UTF-8); UI: en-US
Debian package version: 4:24.2.3-1~bpo12+1
Calc: threaded
--
You are receiving this mail because:
You are the assignee for the bug.