https://bugs.documentfoundation.org/show_bug.cgi?id=169103

            Bug ID: 169103
           Summary: Add option to treat similar characters as equivalent
                    (spaces, hyphens, quotes) in Find & Replace across all
                    applications
           Product: LibreOffice
           Version: Inherited From OOo
          Hardware: All
                OS: All
            Status: UNCONFIRMED
          Severity: enhancement
          Priority: medium
         Component: framework
          Assignee: [email protected]
          Reporter: [email protected]

Currently, LibreOffice cannot find non-breaking hyphens '‑' (U+2011)
(Ctrl+Shift+-) when searching with a regular hyphen '-' (U+002d).

USE CASE: In scientific documents, I replaced all hyphens in a-C abbreviations
with non-breaking hyphens (U+2011) to prevent line breaks. However, this makes
searching difficult — typing a regular hyphen in the Find dialog does not match
non-breaking hyphens.

*a-C (amorphous carbon) 
https://en.wikipedia.org/wiki/Amorphous_carbon
https://www.sciencedirect.com/topics/materials-science/amorphous-carbon

COMPARISON WITH MS WORD:
Microsoft Word treats these characters as equivalent in Find operations:
U+002D  -       Hyphen-Minus (regular)
U+2011  ‑       Non-Breaking Hyphen

When searching for '-', Word automatically finds both variants.


The same problem exists with space variants and quotation mark variants.


ENHANCEMENT:
Add a character normalization option to Find & Replace dialogs.

Proposed checkbox label (choose one):
☑ Treat similar characters as equivalent
☑ Match character variants
☑ Ignore punctuation differences

Note: The label should clearly indicate this is separate from diacritic/accent
handling.


BEHAVIOR WHEN ENABLED:
- All space variants → regular space (U+0020)
- All hyphen/dash variants → hyphen-minus (U+002D)
- All quotation mark variants → standard quotes (U+0022, U+0027)

EXAMPLES:
- Searching for "a-C" matches: a-C, a‑C, a–C, a—C
- Searching for "hello world" matches text with any space variant
- Searching for "test" matches: "test", "test", «test», 『test』

BEHAVIOR WHEN DISABLED (default):
- Only exact character matches (current behavior)



OPEN QUESTION: How should primes and accents be handled?

Primes (′, ″, ‴):
- Used in mathematical/scientific notation: f′(x), 5′ 10″
- Currently often confused with quotation marks
- Recommendation: Keep primes separate to preserve mathematical meaning

Accents (`, ´, ^):
- Typically handled by Unicode NFKC normalization
- Can be standalone (spacing) or combining characters
- Should these be included in character variant matching?

See: [3] https://en.wikipedia.org/wiki/Prime_(symbol)



For the discussion, I suggest this schema for illustrating the range of the
topic:

Text Canonicalization
 └─ Character Folding
     ├─ Unicode Normalization (NFKC, NFKD)
     ├─ Case Folding
     ├─ Accent/Diacritic Folding
     └─ Character Class Normalization
         ├─ Whitespace normalization
         ├─ Hyphen/dash normalization
         └─ Quote normalization

┌──────┬─────────────┬───────────────┬──────────────────────────┐
│ Form │ Decomposed? │ Compatibility?│         Use case         │
├──────┼─────────────┼───────────────┼──────────────────────────┤
│ NFC  │      no     │       no      │ Storage, display         │
│ NFD  │     yes     │       no      │ Accent removal, analysis │
│ NFKC │      no     │      yes      │ Search                   │
│ NFKD │     yes     │      yes      │ Text simplification      │
└──────┴─────────────┴───────────────┴──────────────────────────┘
NFC   = Normalization Form Canonical Composition
NFD   = Normalization Form Canonical Decomposition
NFKC  = Normalization Form Compatibility Composition
NFKD  = Normalization Form Compatibility Decomposition
        [1] https://en.wikipedia.org/wiki/Unicode_equivalence



SPACES (U+0020  )
Code    Char    Name
U+00A0          No-Break Space (NBSP)
U+1680          Ogham Space Mark
U+2000          En Quad
U+2001          Em Quad
U+2002          En Space
U+2003          Em Space
U+2004          Three-Per-Em Space
U+2005          Four-Per-Em Space
U+2006          Six-Per-Em Space
U+2007          Figure Space
U+2008          Punctuation Space
U+2009          Thin Space
U+200A          Hair Space
U+202F          Narrow No-Break Space
U+205F          Medium Mathematical Space
U+3000          Ideographic Space (CJK full-width space)
U+200B  ​       Zero-Width Space (invisible)
U+200C  ‌       Zero-Width Non-Joiner (invisible)
U+200D  ‍       Zero-Width Joiner (invisible)
U+2060  ⁠       Word Joiner (invisible)
U+FEFF         Zero-Width No-Break Space (BOM)



HYPHEN/DASH VARIANTS (U+002D -)
Code    Char    Name
U+002D  -       Hyphen-Minus (ASCII standard)
U+1806  ᠆       Mongolian Todo Soft Hyphen
U+2010  ‐       Hyphen
U+2011  ‑       Non-Breaking Hyphen
U+2012  ‒       Figure Dash
U+2013  –       En Dash
U+2014  —       Em Dash
U+FE58  ﹘       Small Em Dash
U+FE63  ﹣       Small Hyphen-Minus
U+FF0D  −       Fullwidth Hyphen-Minus



SINGLE QUOTES (U+0027 ')
Code    Char    Name
U+2018  '       Left Single Quotation Mark
U+2019  '       Right Single Quotation Mark
U+201A  ‚       Single Low-9 Quotation Mark
U+201B  ‛       Single High-Reversed-9 Quotation Mark
U+2039  ‹       Single Left-Pointing Angle Quotation Mark
U+203A  ›       Single Right-Pointing Angle Quotation Mark
U+275B  ❛       Heavy Single Turned Comma Quotation Mark Ornament
U+275C  ❜       Heavy Single Comma Quotation Mark Ornament
U+276E  ❮       Heavy Left-Pointing Angle Quotation Mark Ornament
U+276F  ❯       Heavy Right-Pointing Angle Quotation Mark Ornament
U+FF07  '       Fullwidth Apostrophe
U+300C  「       Left Corner Bracket (Chinese, Japanese, Korean)
U+300D  」       Right Corner Bracket



DOUBLE QUOTES (U+0022 ")
Code    Char    Name
U+00AB  «       Left-Pointing Double Angle Quotation Mark
U+00BB  »       Right-Pointing Double Angle Quotation Mark
U+201C  "       Left Double Quotation Mark
U+201D  "       Right Double Quotation Mark
U+201E  „       Double Low-9 Quotation Mark
U+201F  ‟       Double High-Reversed-9 Quotation Mark
U+275D  ❝       Heavy Double Turned Comma Quotation Mark Ornament
U+275E  ❞       Heavy Double Comma Quotation Mark Ornament
U+2E42  ⹂       Double Low-Reversed-9 Quotation Mark
U+301D  〝       Reversed Double Prime Quotation Mark
U+301E  〞       Double Prime Quotation Mark
U+301F  〟       Low Double Prime Quotation Mark
U+FF02  "       Fullwidth Quotation Mark
U+300E  『       Left White Corner Bracket
U+300F  』       Right White Corner Bracket



APOSTROPHES (U+0027 ')
Code    Char    Name
U+0027  '       Apostrophe (standard)
U+02BC  ʼ       Modifier Letter Apostrophe
U+02BB  ʻ       Modifier Letter Turned Comma
U+02BD  ʽ       Modifier Letter Reversed Comma
U+02C8  ˈ       Modifier Letter Vertical Line (stress mark)
U+055A  ՚       Armenian Apostrophe
U+2032  ′       Prime (sometimes misused as apostrophe)



PRIMES (mathematical/scientific notation)
Code    Char    Name
U+2032  ′       Prime (minutes, feet, derivatives)
U+2033  ″       Double Prime (seconds, inches)
U+2034  ‴       Triple Prime
U+2035  ‵       Reversed Prime
U+2036  ‶       Reversed Double Prime
U+2037  ‷       Reversed Triple Prime
U+2057  ⁗       Quadruple Prime
U+02B9  ʹ       Modifier Letter Prime
U+02BA  ʺ       Modifier Letter Double Prime



ACCENTS (grave, acute, circumflex)
Code    Char    Name
U+0060  `       Grave Accent (backtick)
U+00B4  ´       Acute Accent (spacing)
U+005E  ^       Circumflex Accent (caret)
U+02C6  ˆ       Modifier Letter Circumflex Accent
U+02C7  ˇ       Caron (háček)
U+02D8  ˘       Breve
U+02D9  ˙       Dot Above
U+02DA  ˚       Ring Above
U+02DC  ˜       Small Tilde
U+02DD  ˝       Double Acute Accent
U+0300  ̀        Combining Grave Accent
U+0301  ́        Combining Acute Accent
U+0302  ̂        Combining Circumflex Accent
U+0303  ̃        Combining Tilde
U+0304  ̄        Combining Macron




[1] https://en.wikipedia.org/wiki/Unicode_equivalence
[2] https://www.compart.com/en/unicode/category/Pd
[3] https://en.wikipedia.org/wiki/Prime_(symbol)


        CODE IMPLEMENTATIONS
[4] Normalize all UTF quotes in Javascript
https://gist.github.com/thanpolas/244d9a13151caf5a12e42208b6111aa6

[5] tehsis/normalize: Normalize a string with utf-8 characters.
https://github.com/tehsis/normalize


        RELATED LIBRARIES
[6] VitorLuizC/normalize-text: 📝 Provides a simple functions to normalize
texts, whitespaces, paragraphs & diacritics.
https://github.com/VitorLuizC/normalize-text

[7] icu/icu4c/source/common/unicode/normalizer2.h at main · unicode-org/icu
https://github.com/unicode-org/icu/blob/main/icu4c/source/common/unicode/normalizer2.h


        GitHub SEARCH: 'NORMALIZATION'
[8] GitbookIO/normall: Normall: normalize filenames, accents etc ... in JS
https://github.com/GitbookIO/normall


        TECHNICAL DOCUMENTATION
[9] normalization · GitHub Topics
https://github.com/topics/normalization?l=python

[10] I18N/CanonicalNormalizationIssues - W3C Wiki
https://www.w3.org/wiki/I18N/CanonicalNormalizationIssues

[11] icu/docs/userguide/transforms/normalization/index.md at main ·
unicode-org/icu
https://github.com/unicode-org/icu/blob/main/docs/userguide/transforms/normalization/index.md

[12] Normalization | ICU Documentation
https://unicode-org.github.io/icu/userguide/transforms/normalization/

[13] Custom Normalization | ICU Documentation
https://unicode-org.github.io/icu/design/normalization/custom.html

[14] ICU 78.1: common/unicode/unorm.h File Reference
https://unicode-org.github.io/icu-docs/apidoc/dev/icu4c/unorm_8h.html

-- 
You are receiving this mail because:
You are the assignee for the bug.

Reply via email to