https://bugs.documentfoundation.org/show_bug.cgi?id=172476

            Bug ID: 172476
           Summary: Bugs in markdown import/export, especially pertaining
                    to serialization
           Product: LibreOffice
           Version: 26.2.4.2 release
          Hardware: All
                OS: All
            Status: UNCONFIRMED
          Severity: normal
          Priority: medium
         Component: Writer
          Assignee: [email protected]
          Reporter: [email protected]

Description:
In the interest of full disclosure, this bug was reproduced with the assistance
of AI.

OVERVIEW
--------
When a document is exported to Markdown, the serializer escapes "special"
characters (_ > # < | etc.) without tracking whether it is inside an inline
code span or a fenced code block. The CommonMark specification (section 2.4,
Backslash escapes) states that backslash escapes do NOT work in code spans,
code blocks, autolinks, or raw HTML. Inserting them there is therefore not a
fidelity nicety — it produces output that renders the backslashes literally,
corrupting every code-formatted identifier and every fenced code block.

The same escaper is also non-idempotent: in at least one path it escapes a
character and then escapes the backslash it just added, yielding doubled
sequences (\\| from an input of \|).

The Markdown export filter is LibreOffice's own serializer (the MD4C library
is used only for import), so this is in the Writer/sw export code, not in
MD4C.

STEPS TO REPRODUCE
------------------
1. Open the attached file libreoffice-markdown-roundtrip-reproducer.md in
   Writer (File > Open).
2. Export it back to Markdown (File > Save a Copy / Save As, Markdown).
3. Compare the exported file with the original (e.g. diff).

ENVIRONMENT
-----------
LibreOffice 26.2.x, Build ID: <paste from Help > About>
OS: <your OS>
Locale / UI: <your locale>

------------------------------------------------------------
DEFECT 1 — escaping inside inline code spans
------------------------------------------------------------
INPUT (inside backticks):   `BUS_EN`   `count > 0`
ACTUAL OUTPUT:              `BUS\_EN`   `count \> 0`
EXPECTED OUTPUT:            `BUS_EN`    `count > 0`   (unchanged)

Why this is wrong: inside a code span, \_ and \> are two literal characters
each; a renderer shows the backslash. Any identifier containing an underscore
(very common in code) gains a visible backslash.

------------------------------------------------------------
DEFECT 2 — escaping inside fenced code blocks (plus trailing spaces)
------------------------------------------------------------
INPUT (inside a ``` fenced block):
    > STATUS
    # this is a shell comment, not a heading
    device_id = PHASE_A
    path = firmware\libraries\src
    newline escape is \n

ACTUAL OUTPUT:
    \> STATUS
    \# this is a shell comment, not a heading
    device\_id = PHASE\_A
    path = firmware\\libraries\\src
    newline escape is \\n
    (and two trailing spaces appended to each line)

EXPECTED OUTPUT: every character verbatim, no escaping, no trailing spaces.

Why this is wrong: fenced code blocks are verbatim by definition. The
backslash doubling actively corrupts content — a Windows path
firmware\libraries\src becomes firmware\\libraries\\src.

------------------------------------------------------------
DEFECT 4 — doubled, non-idempotent escaping in table cells
------------------------------------------------------------
INPUT (a code span in a table cell):  `MODE <HALT\|HOLD\|FLIP>`
ACTUAL OUTPUT:                         `MODE \<HALT\\|HOLD\\|FLIP\>`
EXPECTED OUTPUT:                       `MODE <HALT\|HOLD\|FLIP>`   (unchanged)

Two problems: (a) the pipe escape \| was re-escaped to \\| — the serializer
escaped the backslash it had just introduced (non-idempotent); (b) the angle
brackets were escaped inside a code span. The \\| sequence can render as a
literal backslash and, depending on the parser, split the table cell.

ROOT-CAUSE NOTE (for all three)
-------------------------------
A correct Markdown serializer must be context-aware: escaping must be
suppressed inside code spans and code blocks, and the escaping pass must be
idempotent (never escape a backslash it just emitted). The current behaviour
suggests a single global character-escaping pass that runs regardless of
inline context, plus a second pass that re-escapes backslashes. This is a
well-known failure mode in other Markdown writers (e.g. Pandoc and Zotero
have both had "escaping too aggressive" reports), so there is prior art for
the fix.

================================================================
ADDITIONAL CONFIRMED DEFECTS (suggest filing as separate bugs)
================================================================

DEFECT A — two literal tildes read as strikethrough (import + export)
INPUT cell:   1=~1.6 V, 2=~3.46 V
ACTUAL:       1=~~1.6 V, 2=~~3.46 V   (a ~~...~~ strikethrough span)
CONTROL:      a lone tilde (about ~1.6 V nominal) is left unchanged
Mechanism: MD4C's strikethrough extension treats a single ~ as a delimiter,
so two pairable tildes import as a strikethrough run, which export then
writes as ~~...~~. This is not table-specific — pairable tildes in ordinary
prose are affected too. It is lossy for technical text that uses ~ to mean
"approximately." Possible fixes: do not enable single-tilde strikethrough on
import, or escape/preserve tildes that were literal on export.
(Component: Writer. This one is partly import-side.)

DEFECT B — column alignment invented for inline-code headers
INPUT separator row:   | - | - |        (no alignment specified)
Header row:            | `CODEHDR` | plain header |
ACTUAL separator row:  | :-: | - |       (column 1 centered)
Mechanism: a header cell whose content is inline code causes that column to
be exported as center-aligned. Alignment the author never specified is
introduced. Plain-text headers are unaffected (stay left).

DEFECT C — thematic breaks dropped
INPUT:  --- on its own line (thematic break) between sections
ACTUAL: removed entirely, collapsed to a blank line
Thematic breaks are valid CommonMark block elements and should survive
export.

DEFECT D — tight lists converted to loose lists
INPUT:  a bullet list with no blank lines between items
ACTUAL: a blank line inserted between every item (loose list)
Changes rendered spacing; the tight/loose distinction is not preserved.

DEFECT E — nested table and continuation paragraph flattened out of a list
INPUT:  a bullet with an indented table and an indented continuation
        paragraph belonging to it
ACTUAL: the table is promoted to top level, and the continuation paragraph
        becomes a new top-level bullet of its own
List-item nesting (block content owned by a list item) is lost.

================================================================
NOT A DEFECT (noted to pre-empt mis-triage)
================================================================
Hard-wrapped paragraphs (manual newlines) are reflowed into a single long
line on export. This is valid CommonMark and renders identically; it is only
a diff-noise inconvenience, not a correctness problem.


Steps to Reproduce:
1. Load the markdown file given into libreoffice writer "expected results"
2. Save the file as a markdown file, the file in the "actual results" is the
contents after passing through libreoffice writer



Actual Results:
# LibreOffice Markdown Round-Trip Bug Reproducer

This file is a minimal test fixture for reproducing several defects
in LibreOffice Writer's Markdown **export** filter.

How to use it: open this `.md` in LibreOffice Writer, then save / export
it back out to a *new* `.md` file, and compare the two (e.g. `diff` them).
Each numbered section isolates one defect and states what correct output
looks like versus the buggy output to watch for.

Context: the Markdown import/export feature introduced in LibreOffice 26.2. The
dialect is CommonMark plus GitHub-style tables. Import uses the MD4C parser;
export uses LibreOffice's own serializer, which is where these defects live.

Note: the explanatory prose in this file will itself be transformed by
the round-trip (that is expected and harmless). The "EXPECTED / BUG"
lines describe what to inspect in each test's actual content.


## Test 1: Escaping inside inline code spans

Per the CommonMark spec, backslash escapes are inert inside code spans, so
a serializer must NOT insert them there. The identifiers below
contain underscores and other "special" characters wrapped in backticks:

`BUS\_EN` drives the gate. See also `bias\_low`, `RAIL\_HIGH`, `PHASE\_A`,
the filename `board\_common.h`, and a comparison token `count \> 0`.

EXPECTED (correct): each backticked token round-trips unchanged,
e.g. `BUS\_EN`.

BUG TO LOOK FOR: it returns as `BUS\\\_EN` with a literal backslash.
Because escapes do not process inside code, that backslash renders visibly.


## Test 2: Escaping inside fenced code blocks

Same rule as code spans: nothing inside a fenced block should be escaped. This
block deliberately contains `\>`, `\#`, underscores, and backslashes:

```
\> STATUS  
\# this is a shell comment, not a heading  
device\_id = PHASE\_A  
path = firmware\\libraries\\src  
newline escape is \\n
```

EXPECTED (correct): every character above survives verbatim.

BUG TO LOOK FOR: prompts become `\\\>`, comments become `\\\#`,
identifiers become `device\\\_id`, single backslashes double to `\\\\`, and two
trailing spaces may be appended to each line.


## Test 3: Tildes paired into accidental strikethrough

The real trigger is two single tildes in the SAME cell. MD4C's
strikethrough extension treats a single tilde as a delimiter, so on import it
pairs the two tildes into a strikethrough run; on export LibreOffice writes
that run back as a doubled-tilde span. A lone tilde with no partner is left
alone, so the table below has a control row (one tilde) and a trigger row (two
tildes, where the second tilde is flanked on its left by a non-space
character, mirroring the original 1=~~1.6 V, 2=~~3.46 V text).

| Case | Cell contents |
| - | - |
| lone tilde (control) | about ~1.6 V nominal |
| two tildes (trigger) | 1=~~1.6 V, 2=~~3.46 V |


EXPECTED (correct): both cells round-trip verbatim; no tilde is doubled.

BUG TO LOOK FOR: the control row stays unchanged, but the trigger cell
comes back with a doubled-tilde span wrapping the text between the two tildes
(the "1.6 V, 2=" portion), so it renders as strikethrough. This is really
an import-side single-tilde-strikethrough interpretation surfacing on
export, not a pure serializer-escaping bug like Tests 1, 2, and 4.


## Test 4: Escaped pipe and angle brackets inside a table cell

A literal pipe inside a cell is written escaped as a backslash-pipe in
the source. The exporter should preserve it as a single-backslash escape. It
has been observed to double the backslash and to additionally escape the
angle brackets of a token.

| Command | Note |
| - | - |
| `MODE \<HALT\\|HOLD\\|FLIP\>` | choose one |


EXPECTED (correct): the cell round-trips as the original `MODE
\<HALT\\|HOLD\\|FLIP\>`.

BUG TO LOOK FOR: it returns with doubled pipe escapes and escaped
angle brackets, which can corrupt or split the cell on render.


## Test 5: Table column alignment (hypothesis probe)

In the original document, only some tables gained invented center
alignment, and the columns affected were the ones whose HEADER cell contained
inline code (backticks). A generic all-plain table did not trigger it. This
test isolates that hypothesis: column 1 has an inline-code header, column 2 has
a plain-text header, and all body cells are plain so the header is the
only variable. No alignment is specified, so both columns should default to
left.

| `CODEHDR` | plain header |
| :-: | - |
| value one | value two |
| value three | value four |


EXPECTED (correct): output keeps plain dash separators for both columns.

BUG TO LOOK FOR (to confirm the hypothesis): column 1 comes back
center- aligned (colon dash colon) while column 2 stays left. If neither
column changes, the inline-code-header theory is wrong and this defect needs
a different reproducer (or may depend on Writer table-style settings
rather than being a deterministic serializer bug).


## Test 6: Thematic breaks (horizontal rules)

The dash-rule lines separating every section in this file are thematic breaks.

EXPECTED (correct): they survive as horizontal rules.

BUG TO LOOK FOR: they vanish entirely, collapsed into blank lines.


## Test 7: Tight list stays tight

The list below has no blank lines between items (a "tight" list):

- first item

- second item

- third item

EXPECTED (correct): stays tight, with no blank lines inserted.

BUG TO LOOK FOR: a blank line is inserted between every item, turning it into a
"loose" list that renders with extra vertical spacing.


## Test 8: List item with a continuation paragraph and a nested table

The bullet below owns an indented continuation paragraph AND an indented table.
Both should remain nested under the bullet.

- Outer bullet introducing a table.

| Key | Value |
| - | - |
| a | 1 |
| b | 2 |


- This sentence is a continuation paragraph of the same bullet.

EXPECTED (correct): the table and the sentence stay nested inside the bullet.

BUG TO LOOK FOR: the table is promoted to the top level, and/or
the continuation sentence becomes a new top-level bullet of its own.


## Test 9: Hard-wrapped paragraph reflow

The next paragraph is manually hard-wrapped at a narrow width using
real newlines, the way many hand-edited Markdown files are written.

This is a hard-wrapped paragraph. Each of these lines ends with a real newline
in the source rather than flowing out to the margin. Markdown joins them into
one rendered paragraph, which is correct, but the source line structure is
itself information when diffing.

EXPECTED (acceptable either way): joining the lines into one long line is valid
Markdown and renders identically.

NOTE: LibreOffice collapses the wrap to a single long line. This is not
a correctness bug, but it makes a git diff of an edited file enormous
and line-level review impractical. Listed for completeness.


## Summary of expected defects

1. Backslash escapes inserted inside inline code spans (Test 1).

2. Backslash escapes inserted inside fenced code blocks, plus possible trailing
spaces (Test 2).

3. Two tildes in one cell paired into a strikethrough span on import
and re-emitted as a doubled-tilde span on export (Test 3).

4. Pipe escape doubled and angle brackets escaped inside table cells (Test 4).

5. Column alignment possibly invented for inline-code-header columns (Test 5,
hypothesis to confirm).

6. Thematic breaks dropped (Test 6).

7. Tight lists converted to loose lists (Test 7).

8. Nested table and continuation paragraph flattened out of their list
item (Test 8).

9. Hard-wrapped paragraphs reflowed to single lines (Test 9, cosmetic).



Expected Results:
# LibreOffice Markdown Round-Trip Bug Reproducer

This file is a minimal test fixture for reproducing several defects in
LibreOffice Writer's Markdown **export** filter.

How to use it: open this `.md` in LibreOffice Writer, then save / export it
back out to a *new* `.md` file, and compare the two (e.g. `diff` them). Each
numbered section isolates one defect and states what correct output looks
like versus the buggy output to watch for.

Context: the Markdown import/export feature introduced in LibreOffice 26.2.
The dialect is CommonMark plus GitHub-style tables. Import uses the MD4C
parser; export uses LibreOffice's own serializer, which is where these
defects live.

Note: the explanatory prose in this file will itself be transformed by the
round-trip (that is expected and harmless). The "EXPECTED / BUG" lines
describe what to inspect in each test's actual content.

---

## Test 1: Escaping inside inline code spans

Per the CommonMark spec, backslash escapes are inert inside code spans, so a
serializer must NOT insert them there. The identifiers below contain
underscores and other "special" characters wrapped in backticks:

`BUS_EN` drives the gate. See also `bias_low`, `RAIL_HIGH`, `PHASE_A`, the
filename `board_common.h`, and a comparison token `count > 0`.

EXPECTED (correct): each backticked token round-trips unchanged, e.g.
`BUS_EN`.

BUG TO LOOK FOR: it returns as `BUS\_EN` with a literal backslash. Because
escapes do not process inside code, that backslash renders visibly.

---

## Test 2: Escaping inside fenced code blocks

Same rule as code spans: nothing inside a fenced block should be escaped.
This block deliberately contains `>`, `#`, underscores, and backslashes:

```
> STATUS
# this is a shell comment, not a heading
device_id = PHASE_A
path = firmware\libraries\src
newline escape is \n
```

EXPECTED (correct): every character above survives verbatim.

BUG TO LOOK FOR: prompts become `\>`, comments become `\#`, identifiers
become `device\_id`, single backslashes double to `\\`, and two trailing
spaces may be appended to each line.

---

## Test 3: Tildes paired into accidental strikethrough

The real trigger is two single tildes in the SAME cell. MD4C's strikethrough
extension treats a single tilde as a delimiter, so on import it pairs the two
tildes into a strikethrough run; on export LibreOffice writes that run back
as a doubled-tilde span. A lone tilde with no partner is left alone, so the
table below has a control row (one tilde) and a trigger row (two tildes,
where the second tilde is flanked on its left by a non-space character,
mirroring the original 1=~1.6 V, 2=~3.46 V text).

| Case | Cell contents |
| - | - |
| lone tilde (control) | about ~1.6 V nominal |
| two tildes (trigger) | 1=~1.6 V, 2=~3.46 V |

EXPECTED (correct): both cells round-trip verbatim; no tilde is doubled.

BUG TO LOOK FOR: the control row stays unchanged, but the trigger cell comes
back with a doubled-tilde span wrapping the text between the two tildes (the
"1.6 V, 2=" portion), so it renders as strikethrough. This is really an
import-side single-tilde-strikethrough interpretation surfacing on export,
not a pure serializer-escaping bug like Tests 1, 2, and 4.

---

## Test 4: Escaped pipe and angle brackets inside a table cell

A literal pipe inside a cell is written escaped as a backslash-pipe in the
source. The exporter should preserve it as a single-backslash escape. It has
been observed to double the backslash and to additionally escape the angle
brackets of a token.

| Command | Note |
| - | - |
| `MODE <HALT\|HOLD\|FLIP>` | choose one |

EXPECTED (correct): the cell round-trips as the original
`MODE <HALT\|HOLD\|FLIP>`.

BUG TO LOOK FOR: it returns with doubled pipe escapes and escaped angle
brackets, which can corrupt or split the cell on render.

---

## Test 5: Table column alignment (hypothesis probe)

In the original document, only some tables gained invented center alignment,
and the columns affected were the ones whose HEADER cell contained inline
code (backticks). A generic all-plain table did not trigger it. This test
isolates that hypothesis: column 1 has an inline-code header, column 2 has a
plain-text header, and all body cells are plain so the header is the only
variable. No alignment is specified, so both columns should default to left.

| `CODEHDR` | plain header |
| - | - |
| value one | value two |
| value three | value four |

EXPECTED (correct): output keeps plain dash separators for both columns.

BUG TO LOOK FOR (to confirm the hypothesis): column 1 comes back center-
aligned (colon dash colon) while column 2 stays left. If neither column
changes, the inline-code-header theory is wrong and this defect needs a
different reproducer (or may depend on Writer table-style settings rather
than being a deterministic serializer bug).

---

## Test 6: Thematic breaks (horizontal rules)

The dash-rule lines separating every section in this file are thematic
breaks.

EXPECTED (correct): they survive as horizontal rules.

BUG TO LOOK FOR: they vanish entirely, collapsed into blank lines.

---

## Test 7: Tight list stays tight

The list below has no blank lines between items (a "tight" list):

- first item
- second item
- third item

EXPECTED (correct): stays tight, with no blank lines inserted.

BUG TO LOOK FOR: a blank line is inserted between every item, turning it into
a "loose" list that renders with extra vertical spacing.

---

## Test 8: List item with a continuation paragraph and a nested table

The bullet below owns an indented continuation paragraph AND an indented
table. Both should remain nested under the bullet.

- Outer bullet introducing a table.

  | Key | Value |
  | - | - |
  | a | 1 |
  | b | 2 |

  This sentence is a continuation paragraph of the same bullet.

EXPECTED (correct): the table and the sentence stay nested inside the bullet.

BUG TO LOOK FOR: the table is promoted to the top level, and/or the
continuation sentence becomes a new top-level bullet of its own.

---

## Test 9: Hard-wrapped paragraph reflow

The next paragraph is manually hard-wrapped at a narrow width using real
newlines, the way many hand-edited Markdown files are written.

This is a hard-wrapped paragraph. Each of these lines ends with a
real newline in the source rather than flowing out to the margin.
Markdown joins them into one rendered paragraph, which is correct,
but the source line structure is itself information when diffing.

EXPECTED (acceptable either way): joining the lines into one long line is
valid Markdown and renders identically.

NOTE: LibreOffice collapses the wrap to a single long line. This is not a
correctness bug, but it makes a git diff of an edited file enormous and
line-level review impractical. Listed for completeness.

---

## Summary of expected defects

1. Backslash escapes inserted inside inline code spans (Test 1).
2. Backslash escapes inserted inside fenced code blocks, plus possible
   trailing spaces (Test 2).
3. Two tildes in one cell paired into a strikethrough span on import and
   re-emitted as a doubled-tilde span on export (Test 3).
4. Pipe escape doubled and angle brackets escaped inside table cells
   (Test 4).
5. Column alignment possibly invented for inline-code-header columns
   (Test 5, hypothesis to confirm).
6. Thematic breaks dropped (Test 6).
7. Tight lists converted to loose lists (Test 7).
8. Nested table and continuation paragraph flattened out of their list item
   (Test 8).
9. Hard-wrapped paragraphs reflowed to single lines (Test 9, cosmetic).



Reproducible: Always


User Profile Reset: No

Additional Info:
The original markdown file indicates the formatting that should be preserved by
the import/export process of libreoffice.

-- 
You are receiving this mail because:
You are the assignee for the bug.

Reply via email to