https://bugs.documentfoundation.org/show_bug.cgi?id=171737

            Bug ID: 171737
           Summary: Arabic text extraction from LibreOffice-exported PDF
                    is badly broken with Noto Sans Arabic, especially in
                    vocalized text
           Product: LibreOffice
           Version: 26.2.1.2 release
          Hardware: All
                OS: Linux (All)
            Status: UNCONFIRMED
          Severity: normal
          Priority: medium
         Component: Writer
          Assignee: [email protected]
          Reporter: [email protected]

Description:
When I use `NotoSansArabic-Regular.otf` in LibreOffice Writer and export an
Arabic document to PDF, the PDF renders correctly visually, but text extracted
with `pdftotext` is badly broken.

The extracted text contains extra spaces inside words, broken joining behavior,
and much worse corruption in fully vocalized Arabic text (Arabic with
diacritics / tashkeel). The issue is not primarily a visual rendering problem;
it appears in the PDF text layer or Unicode mapping used for
extraction/search/copy.

Under the same workflow and on the same system, different Arabic fonts behave
differently:

- `Amiri` produces much better extracted text.
- `Noto Naskh Arabic UI` performs better than `Noto Sans Arabic`, but still has
errors in vocalized words.
- `Noto Sans Arabic` is the worst of the three in my tests.

This suggests that LibreOffice PDF export may have a font-dependent issue in
the text layer / `ToUnicode` mapping / cluster handling for Arabic, especially
with diacritics.


Steps to Reproduce:
1. Create a LibreOffice Writer document containing Arabic text with diacritics.
2. Set the text font to `Noto Sans Arabic`.
3. Export the document to PDF from LibreOffice.
4. Run:

```sh
pdftotext output.pdf -
```


Actual Results:
- Extra spaces appear inside Arabic words.
- Word joining is broken in extracted text.
- Fully vocalized Arabic text is much more corrupted than unvocalized text.
- The PDF still looks visually correct when viewed normally.

Expected Results:
Arabic text extracted from a LibreOffice-exported PDF should preserve words
correctly, without inserted spaces or broken joining, and should remain
reasonably faithful to the source text, including vocalized Arabic.



Reproducible: Always


User Profile Reset: No

Additional Info:
### Font

- Font: `NotoSansArabic-Regular.ttf` / `NotoSansArabic-Regular.otf`
- Font version: `2.013`
- Source:
<https://github.com/notofonts/arabic/releases/download/NotoSansArabic-v2.013/NotoSansArabic-v2.013.zip>
- Download date: `2026-04-10`

### Operating System

- `chimera linux`

### LibreOffice Version

```text
LibreOffice Version: 26.2.1.2 (X86_64)
Build ID: 620(Build:2)
CPU threads: 4; OS: Linux 6.19; UI render: default; VCL: gtk3
Locale: en-GB (en_GB.UTF-8); UI: en-US
Calc: threaded
```

### Poppler / pdftotext

- `poppler version: 26.02.0-r0`
- `pdftotext version: 26.02.0`

### Comparison with Other Fonts

Using the same Arabic text and the same LibreOffice-to-PDF workflow:

- `Amiri` gave the best `pdftotext` result.
- `Noto Naskh Arabic UI` was better than `Noto Sans Arabic`, but still
problematic with diacritics.
- `Noto Sans Arabic` produced the worst extracted text.

This comparison seems important because the same application, same export path,
and same extraction tool produce meaningfully different results depending on
the font.

### Sample Text

Unvocalized sample:

```text
والإيمان هو التصديق الجازم بالله وصفاته والكتب والرسل واليوم الآخر والقدر، مع
الإذعان والخضوع والتسليم لذلك، ففي الحديث عن رسول الله ﷺ أن الإيمان «أن تؤمن
بالله وملائكته وكتبه ورسله واليوم الآخر وتؤمن بالقدر خيره وشره» [رواه مسلم]،
و«ما أصابك لم يكن ليخطئك وما أخطأك لم يكن ليصيبك» [رواه أبو داود].
```

Vocalized sample:

```text
وَالإِيمَانُ هُوَ التَّصْدِيقُ الْجَازِمُ بِاللَّهِ وَصِفَاتِهِ وَالْكُتُبِ 
وَالرُّسُلِ وَالْيَوْمِ الآخِرِ وَالْقَدَرِ، مَعَ
الإِذْعَانِ وَالْخُضُوعِ وَالتَّسْلِيمِ لِذَلِكَ.
```

Problematic sample word:

```text
الإيمان
U+0627 U+0644 U+0625 U+064A U+0645 U+0627 U+0646
```

### Example of Bad Extraction

For the vocalized sample above, `pdftotext` output with `Noto Sans Arabic`
looks like this pattern:

```text
َو الِإ يَم اُن ُه َو الَّتْص ِديُق اْلَجاِزُم ِبالَّلِه ...
```

Inserted spaces appear inside words, and combining marks are separated in ways
that make the extracted text unsuitable for reliable search, copy/paste, or
downstream processing.

### Technical Note

In local comparison tests, the exported PDF using `Noto Sans Arabic` was
embedded as a `Type 1` font in the PDF, while `Amiri` was embedded as
`TrueType`. I do not know whether this is the root cause, but it may help
narrow the issue to LibreOffice's PDF export path, font embedding mode, or
`ToUnicode` mapping for Arabic shaping clusters and diacritics.

I am not claiming that LibreOffice is solely at fault, because the severity
appears to vary by font. However, since the problem appears in the exported PDF
text extraction layer rather than in visual rendering, LibreOffice PDF export
seems to be a key part of the issue.

-- 
You are receiving this mail because:
You are the assignee for the bug.

Reply via email to