Yue Yi <[email protected]> writes:

>> What about other Po characters like in
>> https://www.compart.com/en/unicode/category/Po
>> ?
>
> The Po category (Punctuation, other) is a vast collection that goes far
> beyond the daily characters used in Chinese or English. It includes many
> symbols from specialized scripts or historical contexts where the
> spacing convention is effectively "undefined" for a general-purpose
> markup parser.
>
> I believe trying to define a universal spacing rule for every character
> in the Po table might be over-engineering. Maybe the primary goal should
> be to ensure that common CJK delimiters (like 。, ,, !) are treated as
> valid boundaries for emphasis.

Common CJK delimiters are actually covered by (category ?|).
e.g. M-: (category-set-mnemonics (char-category-set ?。)) RET
?| should also cover all other languages that do not use spaces (if it
does not - it is a bug in Emacs)

>> Maybe "Terminal Punctuation" property.
>
> Terminal Punctuation is indeed more promising. If we use it as a
> baseline and then cherry-pick a specific subset—or exclude a few
> problematic ones—to act as valid boundaries, the workload should be
> quite manageable.

See the attached updated patch. I modified the left boundary regexp to
exclude Po characters with Terminal_Punctuation Unicode property
(see https://www.unicode.org/Public/UCD/latest/ucd/PropList.txt)
CJK should still be fine, I think.

That said, I tried
冰淇凌*。 (Hello *world* foo.
And with the new patch
"*。 (Hello *world*" is bold.
(perfectly reasonable given the rules, but looking strange in my eyes)

>From 6fc6da337243f562947734777ef50975857080a1 Mon Sep 17 00:00:00 2001
Message-ID: <6fc6da337243f562947734777ef50975857080a1.1766743635.git.yanta...@posteo.net>
From: Ihor Radchenko <[email protected]>
Date: Sat, 20 Dec 2025 11:58:16 +0100
Subject: [PATCH v3] WIP: Org markup: Allow Unicode punctuation and breakable
 symbols around emphasis

* lisp/org-element.el (org-element-category-table): Define custom
category table adding opening/closing punctuation, opening/closing
quotes, dashes, and auxiliary punctuation.
(org-element--parse-generic-emphasis): Extend allowed characters
around emphasis to generic opening/closing punctuation, quote
punctuation, dash-likes, and auxiliary ,-like punctuation.  Also,
allow breakable characters, like Chinese/Japanese symbols for
languages that do not use spaces.  Make sure that we preserve
boundary asymmetry for :;,!? and similar.
(terminal-punctuation): Helper rx construct matching characters that
are terminal punctuation that is usually followed by space in
languages that use spaces (?,!.: and similar).
* lisp/org.el (org-element-emphasis-pre-re):
(org-element-emphasis-post-re): Add new regexps defining emphasis
boundaries.
* lisp/org.el (org-mode): Setup category table.
(org-emphasis-regexp-components): Allow pre/post to be nil to follow
the new defaults.  Change the default values of pre/past to nil.
(org-set-emph-re):
(org-do-emphasis-faces):
(org-emphasize): Fall back to parser defaults when pre/past in
`org-emphasis-regexp-components' is nil.

Link: https://list.orgmode.org/87ecoi1jug.fsf@localhost/T/#t
---
 lisp/org-element.el | 78 +++++++++++++++++++++++++++++++++--
 lisp/org.el         | 99 +++++++++++++++++++++++++++++++--------------
 2 files changed, 144 insertions(+), 33 deletions(-)

diff --git a/lisp/org-element.el b/lisp/org-element.el
index 6abccd001..14b58324f 100644
--- a/lisp/org-element.el
+++ b/lisp/org-element.el
@@ -3318,6 +3318,79 @@ ;;; Objects
 
 ;;;; Bold
 
+(rx-define terminal-punctuation
+  ;; See https://www.unicode.org/Public/UCD/latest/ucd/PropList.txt
+  (any ?\x0021 ?\x002C ?\x002E (?\x003A . ?\x003B) ?\x003F ?\x037E
+       ?\x0387 ?\x0589 ?\x05C3 ?\x060C ?\x061B (?\x061D . ?\x061F) ?\x06D4
+       (?\x0700 . ?\x070A) ?\x070C (?\x07F8 . ?\x07F9) (?\x0830 . ?\x0835)
+       (?\x0837 . ?\x083E) ?\x085E (?\x0964 . ?\x0965) (?\x0E5A . ?\x0E5B)
+       ?\x0F08 (?\x0F0D . ?\x0F12) (?\x104A . ?\x104B) (?\x1361 . ?\x1368)
+       ?\x166E (?\x16EB . ?\x16ED) (?\x1735 . ?\x1736) (?\x17D4 . ?\x17D6)
+       ?\x17DA (?\x1802 . ?\x1805) (?\x1808 . ?\x1809) (?\x1944 . ?\x1945)
+       (?\x1AA8 . ?\x1AAB) (?\x1B4E . ?\x1B4F) (?\x1B5A . ?\x1B5B) (?\x1B5D . ?\x1B5F)
+       (?\x1B7D . ?\x1B7F) (?\x1C3B . ?\x1C3F) (?\x1C7E . ?\x1C7F)
+       ?\x2024 (?\x203C . ?\x203D) (?\x2047 . ?\x2049) (?\x2CF9 . ?\x2CFB)
+       ?\x2E2E ?\x2E3C ?\x2E41 ?\x2E4C (?\x2E4E . ?\x2E4F)
+       (?\x2E53 . ?\x2E54) (?\x3001 . ?\x3002) (?\xA4FE . ?\xA4FF) (?\xA60D . ?\xA60F)
+       (?\xA6F3 . ?\xA6F7) (?\xA876 . ?\xA877) (?\xA8CE . ?\xA8CF) ?\xA92F
+       (?\xA9C7 . ?\xA9C9) (?\xAA5D . ?\xAA5F) ?\xAADF (?\xAAF0 . ?\xAAF1) ?\xABEB
+       ?\xFE12 (?\xFE15 . ?\xFE16) (?\xFE50 . ?\xFE52) (?\xFE54 . ?\xFE57) ?\xFF01
+       ?\xFF0C ?\xFF0E (?\xFF1A . ?\xFF1B) ?\xFF1F ?\xFF61 ?\xFF64 ?\x1039F ?\x103D0
+       ?\x10857 ?\x1091F (?\x10A56 . ?\x10A57) (?\x10AF0 . ?\x10AF5)
+       (?\x10B3A . ?\x10B3F) (?\x10B99 . ?\x10B9C) (?\x10F55 . ?\x10F59)
+       (?\x10F86 . ?\x10F89) (?\x11047 . ?\x1104D) (?\x110BE . ?\x110C1)
+       (?\x11141 . ?\x11143) (?\x111C5 . ?\x111C6) ?\x111CD (?\x111DE . ?\x111DF)
+       (?\x11238 . ?\x1123C) ?\x112A9 (?\x113D4 . ?\x113D5) (?\x1144B . ?\x1144D)
+       (?\x1145A . ?\x1145B) (?\x115C2 . ?\x115C5) (?\x115C9 . ?\x115D7)
+       (?\x11641 . ?\x11642) (?\x1173C . ?\x1173E) ?\x11944 ?\x11946
+       (?\x11A42 . ?\x11A43) (?\x11A9B . ?\x11A9C) (?\x11AA1 . ?\x11AA2)
+       (?\x11C41 . ?\x11C43) ?\x11C71 (?\x11EF7 . ?\x11EF8)
+       (?\x11F43 . ?\x11F44) (?\x12470 . ?\x12474) (?\x16A6E . ?\x16A6F)
+       ?\x16AF5 (?\x16B37 . ?\x16B39) ?\x16B44 (?\x16D6E . ?\x16D6F)
+       (?\x16E97 . ?\x16E98) ?\x1BC9F (?\x1DA87 . ?\x1DA8A)))
+
+(defvar org-element-category-table
+  (let ((category-table (copy-category-table))
+        (uniprop-table (unicode-property-table-internal 'general-category)))
+    ;; Define categories
+    (define-category ?{ "Opening punctuation" category-table)
+    (define-category ?} "Closing punctuation" category-table)
+    (define-category ?\[ "Initial quote" category-table)
+    (define-category ?\] "Final quote" category-table)
+    (define-category ?- "Dash" category-table)
+    (define-category ?, "Other punctuation" category-table)
+    (define-category ?/ "Other punctuation non-terminal" category-table)
+    ;; Map characters to categories according to their general-category
+    (map-char-table
+     (lambda (key val)
+       (pcase val
+         ('Ps (modify-category-entry key ?{ category-table))
+         ('Pe (modify-category-entry key ?} category-table))
+         ('Pi (modify-category-entry key ?\[ category-table))
+         ('Pf (modify-category-entry key ?\] category-table))
+         ('Pd (modify-category-entry key ?- category-table))
+         ('Po (modify-category-entry key ?, category-table)
+              (mapc
+               (lambda (c)
+                 (unless (string-match-p (rx terminal-punctuation)
+                                         (make-string 1 c))
+                   (modify-category-entry c ?/ category-table)))
+               (if (consp key)
+                   (cl-loop for c from (car key) to (cdr key) collect c)
+                 (list key))))))
+     uniprop-table)
+    category-table)
+  "Category table for Org buffers.
+The table defines additional Unicode categories:
+- ?{ for opening punctuation
+- ?} for closing punctuation
+- ?[ for opening quote
+- ?] for closing quote
+- ?- for dash-like
+- ?, for other punctuation.
+- ?/ for other punctuation that is non-terminal (not .,!? and similar).
+These categories are necessary for parsing emphasis.")
+
 (defun org-element--parse-generic-emphasis (mark type)
   "Parse emphasis object at point, if any.
 
@@ -3331,7 +3404,7 @@ (defun org-element--parse-generic-emphasis (mark type)
       (unless (bolp) (forward-char -1))
       (let ((opening-re
              (rx-to-string
-              `(seq (or line-start (any space ?- ?\( ?' ?\" ?\{))
+              `(seq (regexp ,org-element-emphasis-pre-re)
                     ,mark
                     (not space)))))
         (when (looking-at-p opening-re)
@@ -3341,8 +3414,7 @@ (defun org-element--parse-generic-emphasis (mark type)
                   `(seq
                     (not space)
                     (group ,mark)
-                    (or (any space ?- ?. ?, ?\; ?: ?! ?? ?' ?\" ?\) ?\} ?\\ ?\[)
-                        line-end)))))
+                    (regexp ,org-element-emphasis-post-re)))))
             (when (re-search-forward closing-re nil t)
               (let ((closing (match-end 1)))
                 (goto-char closing)
diff --git a/lisp/org.el b/lisp/org.el
index 67d9679fe..156abc621 100644
--- a/lisp/org.el
+++ b/lisp/org.el
@@ -3823,6 +3823,37 @@ (defcustom org-pretty-entities-include-sub-superscripts t
   :version "24.1"
   :type 'boolean)
 
+(defconst org-element-emphasis-pre-re
+  (rx (or line-start space
+          ;; opening punctuation
+          (category ?{) (category ?\[)
+          ;; dashes
+          (category ?-)
+          ;; other punctuation, except Terminal_Punctuation=True
+          ;; Terminal punctuation is .,!? and similar. In languages
+          ;; with spaces, we do not want such punctuation to be
+          ;; a valid boundary.  Languages without spaces will be
+          ;; included within ?| category.
+          (category ?/)
+          ;; Chinese, Japanese, and other breakable
+          ;; characters
+          (category ?|)))
+  "Regexp matching character before opening emphasis marker.
+Assumes that `org-element-category-table' is active.")
+
+(defconst org-element-emphasis-post-re
+  (rx (or space
+          ;; closing punctuation
+          (category ?}) (category ?\])
+          ;; dashes, other punctuation
+          (category ?-) (category ?,)
+          ;; Chinese, Japanese, and other breakable
+          ;; characters
+          (category ?|)
+          line-end))
+  "Regexp matching character after opening emphasis marker.
+Assumes that `org-element-category-table' is active.")
+
 (defvar org-emph-re nil
   "Regular expression for matching emphasis.
 After a match, the match groups contain these elements:
@@ -3850,10 +3881,13 @@ (defun org-set-emph-re (var val)
 	 (body (if (<= nl 0) body
 		 (format "%s*?\\(?:\n%s*?\\)\\{0,%d\\}" body body nl)))
 	 (template
-	  (format (concat "\\([%s]\\|^\\)" ;before markers
+          ;; See `org-element--parse-generic-emphasis'
+	  (format (concat "\\(%s\\)" ;before markers
 			  "\\(\\([%%s]\\)\\([^%s]\\|[^%s]%s[^%s]\\)\\3\\)"
-			  "\\([%s]\\|$\\)") ;after markers
-		  pre border border body border post)))
+			  "\\(%s\\)") ;after markers
+		  (if pre (format "[%s]\\|^" pre) org-element-emphasis-pre-re)
+                  border border body border
+                  (if post (format "[%s]\\|$" post) org-element-emphasis-post-re))))
       (setq org-emph-re (format template "*/_+"))
       (setq org-verbatim-re (format template "=~")))))
 
@@ -3861,7 +3895,7 @@ (defun org-set-emph-re (var val)
 ;; set this option proved cumbersome.  See this message/thread:
 ;; https://orgmode.org/list/[email protected]
 (defvar org-emphasis-regexp-components
-  '("-[:space:]('\"{" "-[:space:].,:!?;'\")}\\[" "[:space:]" "." 1)
+  '(nil nil "[:space:]" "." 1)
   "Components used to build the regular expression for FONTIFYING emphasis.
 WARNING: This variable only affects visual fontification, but does not
 change Org markup.  For example, it does not affect how emphasis markup
@@ -3874,7 +3908,9 @@ (defvar org-emphasis-regexp-components
 specify what is allowed/forbidden in each part:
 
 pre          Chars allowed as prematch.  Beginning of line will be allowed too.
+             nil means use parser defaults.
 post         Chars allowed as postmatch.  End of line will be allowed too.
+             nil means use parser defaults.
 border       The chars *forbidden* as border characters.
 body-regexp  A regexp like \".\" to match a body character.  Don't use
              non-shy groups here, and don't allow newline here.
@@ -5106,6 +5142,9 @@ (define-derived-mode org-mode outline-mode "Org"
     (org-install-agenda-files-menu))
   (setq-local outline-regexp org-outline-regexp)
   (setq-local outline-level 'org-outline-level)
+  (require 'org-element)
+  (defvar org-element-category-table) ; org-element.el
+  (set-category-table org-element-category-table)
   ;; Initialize cache.
   (org-element-cache-reset)
   (when (and org-element-cache-persistent
@@ -5385,35 +5424,27 @@ (defsubst org-rear-nonsticky-at (pos)
 
 (defun org-do-emphasis-faces (limit)
   "Run through the buffer and emphasize strings."
-  (let ((quick-re (format "\\([%s]\\|^\\)\\([~=*/_+]\\)"
-			  (car org-emphasis-regexp-components))))
+  (let ((quick-re (format "\\(%s\\)\\([~=*/_+]\\)"
+			  (if (car org-emphasis-regexp-components)
+                              (format "[%s]\\|^" (car org-emphasis-regexp-components))
+                            org-element-emphasis-pre-re))))
     (catch :exit
       (while (re-search-forward quick-re limit t)
 	(let* ((marker (match-string 2))
-	       (verbatim? (member marker '("~" "="))))
+	       (verbatim? (member marker '("~" "=")))
+               (context (save-match-data
+                          (save-excursion
+                            (goto-char (match-beginning 2))
+                            (org-element-context)))))
 	  (when (save-excursion
 		  (goto-char (match-beginning 0))
 		  (and
-		   ;; Do not match table hlines.
-		   (not (and (equal marker "+")
-			     (org-match-line
-			      "[ \t]*\\(|[-+]+|?\\|\\+[-+]+\\+\\)[ \t]*$")))
-		   ;; Do not match headline stars.  Do not consider
-		   ;; stars of a headline as closing marker for bold
-		   ;; markup either.
-		   (not (and (equal marker "*")
-			     (save-excursion
-			       (forward-char)
-			       (skip-chars-backward "*")
-			       (looking-at-p org-outline-regexp-bol))))
+                   (org-element-type-p
+                    context
+                    '(bold code italic strike-through underline verbatim))
+                   (equal (match-beginning 2) (org-element-begin context))
 		   ;; Match full emphasis markup regexp.
-		   (looking-at (if verbatim? org-verbatim-re org-emph-re))
-		   ;; Do not span over paragraph boundaries.
-		   (not (string-match-p org-element-paragraph-separate
-					(match-string 2)))
-		   ;; Do not span over cells in table rows.
-		   (not (and (save-match-data (org-match-line "[ \t]*|"))
-			     (string-match-p "|" (match-string 4))))))
+		   (looking-at (if verbatim? org-verbatim-re org-emph-re))))
 	    (pcase-let ((`(,_ ,face ,_) (assoc marker org-emphasis-alist))
 			(m (if org-hide-emphasis-markers 4 2)))
 	      (font-lock-prepend-text-property
@@ -5478,12 +5509,20 @@ (defun org-emphasize (&optional char)
     (setq string (concat s string s))
     (when beg (delete-region beg end))
     (unless (or (bolp)
-		(string-match (concat "[" (nth 0 erc) "\n]")
-			      (char-to-string (char-before (point)))))
+		(string-match
+                 (if (nth 0 erc) (concat "[" (nth 0 erc) "\n]")
+                   ;; See `org-element--parse-generic-emphasis'
+                   (rx (or (regexp org-element-emphasis-pre-re)
+                           "\n")))
+		 (char-to-string (char-before (point)))))
       (insert " "))
     (unless (or (eobp)
-		(string-match (concat "[" (nth 1 erc) "\n]")
-			      (char-to-string (char-after (point)))))
+		(string-match
+                 ;; See `org-element--parse-generic-emphasis'
+                 (if (nth 1 erc) (concat "[" (nth 1 erc) "\n]")
+                   (rx (or (regexp org-element-emphasis-post-re)
+                           "\n")))
+		 (char-to-string (char-after (point)))))
       (insert " ") (backward-char 1))
     (insert string)
     (and move (backward-char 1))))
-- 
2.52.0

-- 
Ihor Radchenko // yantar92,
Org mode maintainer,
Learn more about Org mode at <https://orgmode.org/>.
Support Org development at <https://liberapay.com/org-mode>,
or support my work at <https://liberapay.com/yantar92>

Reply via email to