https://bugs.documentfoundation.org/show_bug.cgi?id=169103
Bug ID: 169103
Summary: Add option to treat similar characters as equivalent
(spaces, hyphens, quotes) in Find & Replace across all
applications
Product: LibreOffice
Version: Inherited From OOo
Hardware: All
OS: All
Status: UNCONFIRMED
Severity: enhancement
Priority: medium
Component: framework
Assignee: [email protected]
Reporter: [email protected]
Currently, LibreOffice cannot find non-breaking hyphens '‑' (U+2011)
(Ctrl+Shift+-) when searching with a regular hyphen '-' (U+002d).
USE CASE: In scientific documents, I replaced all hyphens in a-C abbreviations
with non-breaking hyphens (U+2011) to prevent line breaks. However, this makes
searching difficult — typing a regular hyphen in the Find dialog does not match
non-breaking hyphens.
*a-C (amorphous carbon)
https://en.wikipedia.org/wiki/Amorphous_carbon
https://www.sciencedirect.com/topics/materials-science/amorphous-carbon
COMPARISON WITH MS WORD:
Microsoft Word treats these characters as equivalent in Find operations:
U+002D - Hyphen-Minus (regular)
U+2011 ‑ Non-Breaking Hyphen
When searching for '-', Word automatically finds both variants.
The same problem exists with space variants and quotation mark variants.
ENHANCEMENT:
Add a character normalization option to Find & Replace dialogs.
Proposed checkbox label (choose one):
☑ Treat similar characters as equivalent
☑ Match character variants
☑ Ignore punctuation differences
Note: The label should clearly indicate this is separate from diacritic/accent
handling.
BEHAVIOR WHEN ENABLED:
- All space variants → regular space (U+0020)
- All hyphen/dash variants → hyphen-minus (U+002D)
- All quotation mark variants → standard quotes (U+0022, U+0027)
EXAMPLES:
- Searching for "a-C" matches: a-C, a‑C, a–C, a—C
- Searching for "hello world" matches text with any space variant
- Searching for "test" matches: "test", "test", «test», 『test』
BEHAVIOR WHEN DISABLED (default):
- Only exact character matches (current behavior)
OPEN QUESTION: How should primes and accents be handled?
Primes (′, ″, ‴):
- Used in mathematical/scientific notation: f′(x), 5′ 10″
- Currently often confused with quotation marks
- Recommendation: Keep primes separate to preserve mathematical meaning
Accents (`, ´, ^):
- Typically handled by Unicode NFKC normalization
- Can be standalone (spacing) or combining characters
- Should these be included in character variant matching?
See: [3] https://en.wikipedia.org/wiki/Prime_(symbol)
For the discussion, I suggest this schema for illustrating the range of the
topic:
Text Canonicalization
└─ Character Folding
├─ Unicode Normalization (NFKC, NFKD)
├─ Case Folding
├─ Accent/Diacritic Folding
└─ Character Class Normalization
├─ Whitespace normalization
├─ Hyphen/dash normalization
└─ Quote normalization
┌──────┬─────────────┬───────────────┬──────────────────────────┐
│ Form │ Decomposed? │ Compatibility?│ Use case │
├──────┼─────────────┼───────────────┼──────────────────────────┤
│ NFC │ no │ no │ Storage, display │
│ NFD │ yes │ no │ Accent removal, analysis │
│ NFKC │ no │ yes │ Search │
│ NFKD │ yes │ yes │ Text simplification │
└──────┴─────────────┴───────────────┴──────────────────────────┘
NFC = Normalization Form Canonical Composition
NFD = Normalization Form Canonical Decomposition
NFKC = Normalization Form Compatibility Composition
NFKD = Normalization Form Compatibility Decomposition
[1] https://en.wikipedia.org/wiki/Unicode_equivalence
SPACES (U+0020 )
Code Char Name
U+00A0 No-Break Space (NBSP)
U+1680 Ogham Space Mark
U+2000 En Quad
U+2001 Em Quad
U+2002 En Space
U+2003 Em Space
U+2004 Three-Per-Em Space
U+2005 Four-Per-Em Space
U+2006 Six-Per-Em Space
U+2007 Figure Space
U+2008 Punctuation Space
U+2009 Thin Space
U+200A Hair Space
U+202F Narrow No-Break Space
U+205F Medium Mathematical Space
U+3000 Ideographic Space (CJK full-width space)
U+200B Zero-Width Space (invisible)
U+200C Zero-Width Non-Joiner (invisible)
U+200D Zero-Width Joiner (invisible)
U+2060 Word Joiner (invisible)
U+FEFF Zero-Width No-Break Space (BOM)
HYPHEN/DASH VARIANTS (U+002D -)
Code Char Name
U+002D - Hyphen-Minus (ASCII standard)
U+1806 ᠆ Mongolian Todo Soft Hyphen
U+2010 ‐ Hyphen
U+2011 ‑ Non-Breaking Hyphen
U+2012 ‒ Figure Dash
U+2013 – En Dash
U+2014 — Em Dash
U+FE58 ﹘ Small Em Dash
U+FE63 ﹣ Small Hyphen-Minus
U+FF0D − Fullwidth Hyphen-Minus
SINGLE QUOTES (U+0027 ')
Code Char Name
U+2018 ' Left Single Quotation Mark
U+2019 ' Right Single Quotation Mark
U+201A ‚ Single Low-9 Quotation Mark
U+201B ‛ Single High-Reversed-9 Quotation Mark
U+2039 ‹ Single Left-Pointing Angle Quotation Mark
U+203A › Single Right-Pointing Angle Quotation Mark
U+275B ❛ Heavy Single Turned Comma Quotation Mark Ornament
U+275C ❜ Heavy Single Comma Quotation Mark Ornament
U+276E ❮ Heavy Left-Pointing Angle Quotation Mark Ornament
U+276F ❯ Heavy Right-Pointing Angle Quotation Mark Ornament
U+FF07 ' Fullwidth Apostrophe
U+300C 「 Left Corner Bracket (Chinese, Japanese, Korean)
U+300D 」 Right Corner Bracket
DOUBLE QUOTES (U+0022 ")
Code Char Name
U+00AB « Left-Pointing Double Angle Quotation Mark
U+00BB » Right-Pointing Double Angle Quotation Mark
U+201C " Left Double Quotation Mark
U+201D " Right Double Quotation Mark
U+201E „ Double Low-9 Quotation Mark
U+201F ‟ Double High-Reversed-9 Quotation Mark
U+275D ❝ Heavy Double Turned Comma Quotation Mark Ornament
U+275E ❞ Heavy Double Comma Quotation Mark Ornament
U+2E42 ⹂ Double Low-Reversed-9 Quotation Mark
U+301D 〝 Reversed Double Prime Quotation Mark
U+301E 〞 Double Prime Quotation Mark
U+301F 〟 Low Double Prime Quotation Mark
U+FF02 " Fullwidth Quotation Mark
U+300E 『 Left White Corner Bracket
U+300F 』 Right White Corner Bracket
APOSTROPHES (U+0027 ')
Code Char Name
U+0027 ' Apostrophe (standard)
U+02BC ʼ Modifier Letter Apostrophe
U+02BB ʻ Modifier Letter Turned Comma
U+02BD ʽ Modifier Letter Reversed Comma
U+02C8 ˈ Modifier Letter Vertical Line (stress mark)
U+055A ՚ Armenian Apostrophe
U+2032 ′ Prime (sometimes misused as apostrophe)
PRIMES (mathematical/scientific notation)
Code Char Name
U+2032 ′ Prime (minutes, feet, derivatives)
U+2033 ″ Double Prime (seconds, inches)
U+2034 ‴ Triple Prime
U+2035 ‵ Reversed Prime
U+2036 ‶ Reversed Double Prime
U+2037 ‷ Reversed Triple Prime
U+2057 ⁗ Quadruple Prime
U+02B9 ʹ Modifier Letter Prime
U+02BA ʺ Modifier Letter Double Prime
ACCENTS (grave, acute, circumflex)
Code Char Name
U+0060 ` Grave Accent (backtick)
U+00B4 ´ Acute Accent (spacing)
U+005E ^ Circumflex Accent (caret)
U+02C6 ˆ Modifier Letter Circumflex Accent
U+02C7 ˇ Caron (háček)
U+02D8 ˘ Breve
U+02D9 ˙ Dot Above
U+02DA ˚ Ring Above
U+02DC ˜ Small Tilde
U+02DD ˝ Double Acute Accent
U+0300 ̀ Combining Grave Accent
U+0301 ́ Combining Acute Accent
U+0302 ̂ Combining Circumflex Accent
U+0303 ̃ Combining Tilde
U+0304 ̄ Combining Macron
[1] https://en.wikipedia.org/wiki/Unicode_equivalence
[2] https://www.compart.com/en/unicode/category/Pd
[3] https://en.wikipedia.org/wiki/Prime_(symbol)
CODE IMPLEMENTATIONS
[4] Normalize all UTF quotes in Javascript
https://gist.github.com/thanpolas/244d9a13151caf5a12e42208b6111aa6
[5] tehsis/normalize: Normalize a string with utf-8 characters.
https://github.com/tehsis/normalize
RELATED LIBRARIES
[6] VitorLuizC/normalize-text: 📝 Provides a simple functions to normalize
texts, whitespaces, paragraphs & diacritics.
https://github.com/VitorLuizC/normalize-text
[7] icu/icu4c/source/common/unicode/normalizer2.h at main · unicode-org/icu
https://github.com/unicode-org/icu/blob/main/icu4c/source/common/unicode/normalizer2.h
GitHub SEARCH: 'NORMALIZATION'
[8] GitbookIO/normall: Normall: normalize filenames, accents etc ... in JS
https://github.com/GitbookIO/normall
TECHNICAL DOCUMENTATION
[9] normalization · GitHub Topics
https://github.com/topics/normalization?l=python
[10] I18N/CanonicalNormalizationIssues - W3C Wiki
https://www.w3.org/wiki/I18N/CanonicalNormalizationIssues
[11] icu/docs/userguide/transforms/normalization/index.md at main ·
unicode-org/icu
https://github.com/unicode-org/icu/blob/main/docs/userguide/transforms/normalization/index.md
[12] Normalization | ICU Documentation
https://unicode-org.github.io/icu/userguide/transforms/normalization/
[13] Custom Normalization | ICU Documentation
https://unicode-org.github.io/icu/design/normalization/custom.html
[14] ICU 78.1: common/unicode/unorm.h File Reference
https://unicode-org.github.io/icu-docs/apidoc/dev/icu4c/unorm_8h.html
--
You are receiving this mail because:
You are the assignee for the bug.