Yue Yi <[email protected]> writes: >> What about other Po characters like in >> https://www.compart.com/en/unicode/category/Po >> ? > > The Po category (Punctuation, other) is a vast collection that goes far > beyond the daily characters used in Chinese or English. It includes many > symbols from specialized scripts or historical contexts where the > spacing convention is effectively "undefined" for a general-purpose > markup parser. > > I believe trying to define a universal spacing rule for every character > in the Po table might be over-engineering. Maybe the primary goal should > be to ensure that common CJK delimiters (like 。, ,, !) are treated as > valid boundaries for emphasis.
Common CJK delimiters are actually covered by (category ?|). e.g. M-: (category-set-mnemonics (char-category-set ?。)) RET ?| should also cover all other languages that do not use spaces (if it does not - it is a bug in Emacs) >> Maybe "Terminal Punctuation" property. > > Terminal Punctuation is indeed more promising. If we use it as a > baseline and then cherry-pick a specific subset—or exclude a few > problematic ones—to act as valid boundaries, the workload should be > quite manageable. See the attached updated patch. I modified the left boundary regexp to exclude Po characters with Terminal_Punctuation Unicode property (see https://www.unicode.org/Public/UCD/latest/ucd/PropList.txt) CJK should still be fine, I think. That said, I tried 冰淇凌*。 (Hello *world* foo. And with the new patch "*。 (Hello *world*" is bold. (perfectly reasonable given the rules, but looking strange in my eyes)
>From 6fc6da337243f562947734777ef50975857080a1 Mon Sep 17 00:00:00 2001 Message-ID: <6fc6da337243f562947734777ef50975857080a1.1766743635.git.yanta...@posteo.net> From: Ihor Radchenko <[email protected]> Date: Sat, 20 Dec 2025 11:58:16 +0100 Subject: [PATCH v3] WIP: Org markup: Allow Unicode punctuation and breakable symbols around emphasis * lisp/org-element.el (org-element-category-table): Define custom category table adding opening/closing punctuation, opening/closing quotes, dashes, and auxiliary punctuation. (org-element--parse-generic-emphasis): Extend allowed characters around emphasis to generic opening/closing punctuation, quote punctuation, dash-likes, and auxiliary ,-like punctuation. Also, allow breakable characters, like Chinese/Japanese symbols for languages that do not use spaces. Make sure that we preserve boundary asymmetry for :;,!? and similar. (terminal-punctuation): Helper rx construct matching characters that are terminal punctuation that is usually followed by space in languages that use spaces (?,!.: and similar). * lisp/org.el (org-element-emphasis-pre-re): (org-element-emphasis-post-re): Add new regexps defining emphasis boundaries. * lisp/org.el (org-mode): Setup category table. (org-emphasis-regexp-components): Allow pre/post to be nil to follow the new defaults. Change the default values of pre/past to nil. (org-set-emph-re): (org-do-emphasis-faces): (org-emphasize): Fall back to parser defaults when pre/past in `org-emphasis-regexp-components' is nil. Link: https://list.orgmode.org/87ecoi1jug.fsf@localhost/T/#t --- lisp/org-element.el | 78 +++++++++++++++++++++++++++++++++-- lisp/org.el | 99 +++++++++++++++++++++++++++++++-------------- 2 files changed, 144 insertions(+), 33 deletions(-) diff --git a/lisp/org-element.el b/lisp/org-element.el index 6abccd001..14b58324f 100644 --- a/lisp/org-element.el +++ b/lisp/org-element.el @@ -3318,6 +3318,79 @@ ;;; Objects ;;;; Bold +(rx-define terminal-punctuation + ;; See https://www.unicode.org/Public/UCD/latest/ucd/PropList.txt + (any ?\x0021 ?\x002C ?\x002E (?\x003A . ?\x003B) ?\x003F ?\x037E + ?\x0387 ?\x0589 ?\x05C3 ?\x060C ?\x061B (?\x061D . ?\x061F) ?\x06D4 + (?\x0700 . ?\x070A) ?\x070C (?\x07F8 . ?\x07F9) (?\x0830 . ?\x0835) + (?\x0837 . ?\x083E) ?\x085E (?\x0964 . ?\x0965) (?\x0E5A . ?\x0E5B) + ?\x0F08 (?\x0F0D . ?\x0F12) (?\x104A . ?\x104B) (?\x1361 . ?\x1368) + ?\x166E (?\x16EB . ?\x16ED) (?\x1735 . ?\x1736) (?\x17D4 . ?\x17D6) + ?\x17DA (?\x1802 . ?\x1805) (?\x1808 . ?\x1809) (?\x1944 . ?\x1945) + (?\x1AA8 . ?\x1AAB) (?\x1B4E . ?\x1B4F) (?\x1B5A . ?\x1B5B) (?\x1B5D . ?\x1B5F) + (?\x1B7D . ?\x1B7F) (?\x1C3B . ?\x1C3F) (?\x1C7E . ?\x1C7F) + ?\x2024 (?\x203C . ?\x203D) (?\x2047 . ?\x2049) (?\x2CF9 . ?\x2CFB) + ?\x2E2E ?\x2E3C ?\x2E41 ?\x2E4C (?\x2E4E . ?\x2E4F) + (?\x2E53 . ?\x2E54) (?\x3001 . ?\x3002) (?\xA4FE . ?\xA4FF) (?\xA60D . ?\xA60F) + (?\xA6F3 . ?\xA6F7) (?\xA876 . ?\xA877) (?\xA8CE . ?\xA8CF) ?\xA92F + (?\xA9C7 . ?\xA9C9) (?\xAA5D . ?\xAA5F) ?\xAADF (?\xAAF0 . ?\xAAF1) ?\xABEB + ?\xFE12 (?\xFE15 . ?\xFE16) (?\xFE50 . ?\xFE52) (?\xFE54 . ?\xFE57) ?\xFF01 + ?\xFF0C ?\xFF0E (?\xFF1A . ?\xFF1B) ?\xFF1F ?\xFF61 ?\xFF64 ?\x1039F ?\x103D0 + ?\x10857 ?\x1091F (?\x10A56 . ?\x10A57) (?\x10AF0 . ?\x10AF5) + (?\x10B3A . ?\x10B3F) (?\x10B99 . ?\x10B9C) (?\x10F55 . ?\x10F59) + (?\x10F86 . ?\x10F89) (?\x11047 . ?\x1104D) (?\x110BE . ?\x110C1) + (?\x11141 . ?\x11143) (?\x111C5 . ?\x111C6) ?\x111CD (?\x111DE . ?\x111DF) + (?\x11238 . ?\x1123C) ?\x112A9 (?\x113D4 . ?\x113D5) (?\x1144B . ?\x1144D) + (?\x1145A . ?\x1145B) (?\x115C2 . ?\x115C5) (?\x115C9 . ?\x115D7) + (?\x11641 . ?\x11642) (?\x1173C . ?\x1173E) ?\x11944 ?\x11946 + (?\x11A42 . ?\x11A43) (?\x11A9B . ?\x11A9C) (?\x11AA1 . ?\x11AA2) + (?\x11C41 . ?\x11C43) ?\x11C71 (?\x11EF7 . ?\x11EF8) + (?\x11F43 . ?\x11F44) (?\x12470 . ?\x12474) (?\x16A6E . ?\x16A6F) + ?\x16AF5 (?\x16B37 . ?\x16B39) ?\x16B44 (?\x16D6E . ?\x16D6F) + (?\x16E97 . ?\x16E98) ?\x1BC9F (?\x1DA87 . ?\x1DA8A))) + +(defvar org-element-category-table + (let ((category-table (copy-category-table)) + (uniprop-table (unicode-property-table-internal 'general-category))) + ;; Define categories + (define-category ?{ "Opening punctuation" category-table) + (define-category ?} "Closing punctuation" category-table) + (define-category ?\[ "Initial quote" category-table) + (define-category ?\] "Final quote" category-table) + (define-category ?- "Dash" category-table) + (define-category ?, "Other punctuation" category-table) + (define-category ?/ "Other punctuation non-terminal" category-table) + ;; Map characters to categories according to their general-category + (map-char-table + (lambda (key val) + (pcase val + ('Ps (modify-category-entry key ?{ category-table)) + ('Pe (modify-category-entry key ?} category-table)) + ('Pi (modify-category-entry key ?\[ category-table)) + ('Pf (modify-category-entry key ?\] category-table)) + ('Pd (modify-category-entry key ?- category-table)) + ('Po (modify-category-entry key ?, category-table) + (mapc + (lambda (c) + (unless (string-match-p (rx terminal-punctuation) + (make-string 1 c)) + (modify-category-entry c ?/ category-table))) + (if (consp key) + (cl-loop for c from (car key) to (cdr key) collect c) + (list key)))))) + uniprop-table) + category-table) + "Category table for Org buffers. +The table defines additional Unicode categories: +- ?{ for opening punctuation +- ?} for closing punctuation +- ?[ for opening quote +- ?] for closing quote +- ?- for dash-like +- ?, for other punctuation. +- ?/ for other punctuation that is non-terminal (not .,!? and similar). +These categories are necessary for parsing emphasis.") + (defun org-element--parse-generic-emphasis (mark type) "Parse emphasis object at point, if any. @@ -3331,7 +3404,7 @@ (defun org-element--parse-generic-emphasis (mark type) (unless (bolp) (forward-char -1)) (let ((opening-re (rx-to-string - `(seq (or line-start (any space ?- ?\( ?' ?\" ?\{)) + `(seq (regexp ,org-element-emphasis-pre-re) ,mark (not space))))) (when (looking-at-p opening-re) @@ -3341,8 +3414,7 @@ (defun org-element--parse-generic-emphasis (mark type) `(seq (not space) (group ,mark) - (or (any space ?- ?. ?, ?\; ?: ?! ?? ?' ?\" ?\) ?\} ?\\ ?\[) - line-end))))) + (regexp ,org-element-emphasis-post-re))))) (when (re-search-forward closing-re nil t) (let ((closing (match-end 1))) (goto-char closing) diff --git a/lisp/org.el b/lisp/org.el index 67d9679fe..156abc621 100644 --- a/lisp/org.el +++ b/lisp/org.el @@ -3823,6 +3823,37 @@ (defcustom org-pretty-entities-include-sub-superscripts t :version "24.1" :type 'boolean) +(defconst org-element-emphasis-pre-re + (rx (or line-start space + ;; opening punctuation + (category ?{) (category ?\[) + ;; dashes + (category ?-) + ;; other punctuation, except Terminal_Punctuation=True + ;; Terminal punctuation is .,!? and similar. In languages + ;; with spaces, we do not want such punctuation to be + ;; a valid boundary. Languages without spaces will be + ;; included within ?| category. + (category ?/) + ;; Chinese, Japanese, and other breakable + ;; characters + (category ?|))) + "Regexp matching character before opening emphasis marker. +Assumes that `org-element-category-table' is active.") + +(defconst org-element-emphasis-post-re + (rx (or space + ;; closing punctuation + (category ?}) (category ?\]) + ;; dashes, other punctuation + (category ?-) (category ?,) + ;; Chinese, Japanese, and other breakable + ;; characters + (category ?|) + line-end)) + "Regexp matching character after opening emphasis marker. +Assumes that `org-element-category-table' is active.") + (defvar org-emph-re nil "Regular expression for matching emphasis. After a match, the match groups contain these elements: @@ -3850,10 +3881,13 @@ (defun org-set-emph-re (var val) (body (if (<= nl 0) body (format "%s*?\\(?:\n%s*?\\)\\{0,%d\\}" body body nl))) (template - (format (concat "\\([%s]\\|^\\)" ;before markers + ;; See `org-element--parse-generic-emphasis' + (format (concat "\\(%s\\)" ;before markers "\\(\\([%%s]\\)\\([^%s]\\|[^%s]%s[^%s]\\)\\3\\)" - "\\([%s]\\|$\\)") ;after markers - pre border border body border post))) + "\\(%s\\)") ;after markers + (if pre (format "[%s]\\|^" pre) org-element-emphasis-pre-re) + border border body border + (if post (format "[%s]\\|$" post) org-element-emphasis-post-re)))) (setq org-emph-re (format template "*/_+")) (setq org-verbatim-re (format template "=~"))))) @@ -3861,7 +3895,7 @@ (defun org-set-emph-re (var val) ;; set this option proved cumbersome. See this message/thread: ;; https://orgmode.org/list/[email protected] (defvar org-emphasis-regexp-components - '("-[:space:]('\"{" "-[:space:].,:!?;'\")}\\[" "[:space:]" "." 1) + '(nil nil "[:space:]" "." 1) "Components used to build the regular expression for FONTIFYING emphasis. WARNING: This variable only affects visual fontification, but does not change Org markup. For example, it does not affect how emphasis markup @@ -3874,7 +3908,9 @@ (defvar org-emphasis-regexp-components specify what is allowed/forbidden in each part: pre Chars allowed as prematch. Beginning of line will be allowed too. + nil means use parser defaults. post Chars allowed as postmatch. End of line will be allowed too. + nil means use parser defaults. border The chars *forbidden* as border characters. body-regexp A regexp like \".\" to match a body character. Don't use non-shy groups here, and don't allow newline here. @@ -5106,6 +5142,9 @@ (define-derived-mode org-mode outline-mode "Org" (org-install-agenda-files-menu)) (setq-local outline-regexp org-outline-regexp) (setq-local outline-level 'org-outline-level) + (require 'org-element) + (defvar org-element-category-table) ; org-element.el + (set-category-table org-element-category-table) ;; Initialize cache. (org-element-cache-reset) (when (and org-element-cache-persistent @@ -5385,35 +5424,27 @@ (defsubst org-rear-nonsticky-at (pos) (defun org-do-emphasis-faces (limit) "Run through the buffer and emphasize strings." - (let ((quick-re (format "\\([%s]\\|^\\)\\([~=*/_+]\\)" - (car org-emphasis-regexp-components)))) + (let ((quick-re (format "\\(%s\\)\\([~=*/_+]\\)" + (if (car org-emphasis-regexp-components) + (format "[%s]\\|^" (car org-emphasis-regexp-components)) + org-element-emphasis-pre-re)))) (catch :exit (while (re-search-forward quick-re limit t) (let* ((marker (match-string 2)) - (verbatim? (member marker '("~" "=")))) + (verbatim? (member marker '("~" "="))) + (context (save-match-data + (save-excursion + (goto-char (match-beginning 2)) + (org-element-context))))) (when (save-excursion (goto-char (match-beginning 0)) (and - ;; Do not match table hlines. - (not (and (equal marker "+") - (org-match-line - "[ \t]*\\(|[-+]+|?\\|\\+[-+]+\\+\\)[ \t]*$"))) - ;; Do not match headline stars. Do not consider - ;; stars of a headline as closing marker for bold - ;; markup either. - (not (and (equal marker "*") - (save-excursion - (forward-char) - (skip-chars-backward "*") - (looking-at-p org-outline-regexp-bol)))) + (org-element-type-p + context + '(bold code italic strike-through underline verbatim)) + (equal (match-beginning 2) (org-element-begin context)) ;; Match full emphasis markup regexp. - (looking-at (if verbatim? org-verbatim-re org-emph-re)) - ;; Do not span over paragraph boundaries. - (not (string-match-p org-element-paragraph-separate - (match-string 2))) - ;; Do not span over cells in table rows. - (not (and (save-match-data (org-match-line "[ \t]*|")) - (string-match-p "|" (match-string 4)))))) + (looking-at (if verbatim? org-verbatim-re org-emph-re)))) (pcase-let ((`(,_ ,face ,_) (assoc marker org-emphasis-alist)) (m (if org-hide-emphasis-markers 4 2))) (font-lock-prepend-text-property @@ -5478,12 +5509,20 @@ (defun org-emphasize (&optional char) (setq string (concat s string s)) (when beg (delete-region beg end)) (unless (or (bolp) - (string-match (concat "[" (nth 0 erc) "\n]") - (char-to-string (char-before (point))))) + (string-match + (if (nth 0 erc) (concat "[" (nth 0 erc) "\n]") + ;; See `org-element--parse-generic-emphasis' + (rx (or (regexp org-element-emphasis-pre-re) + "\n"))) + (char-to-string (char-before (point))))) (insert " ")) (unless (or (eobp) - (string-match (concat "[" (nth 1 erc) "\n]") - (char-to-string (char-after (point))))) + (string-match + ;; See `org-element--parse-generic-emphasis' + (if (nth 1 erc) (concat "[" (nth 1 erc) "\n]") + (rx (or (regexp org-element-emphasis-post-re) + "\n"))) + (char-to-string (char-after (point))))) (insert " ") (backward-char 1)) (insert string) (and move (backward-char 1)))) -- 2.52.0
-- Ihor Radchenko // yantar92, Org mode maintainer, Learn more about Org mode at <https://orgmode.org/>. Support Org development at <https://liberapay.com/org-mode>, or support my work at <https://liberapay.com/yantar92>
