Re: Bug: ODT export of Chinese text inserts spaces for line breaks

2022-10-20 Thread Ihor Radchenko
Ihor Radchenko  writes:

> I am attaching the fix that leverages `fill-region' to handle all the
> complexities for us. It is the easiest way and I see no reason to look
> deeper.

Applied onto main.
https://git.savannah.gnu.org/cgit/emacs/org-mode.git/commit/?id=3502ce2dbb29b70cdbb978d144322d48cb00f26d

-- 
Ihor Radchenko // yantar92,
Org mode contributor,
Learn more about Org mode at .
Support Org development at ,
or support my work at 



Re: Bug: ODT export of Chinese text inserts spaces for line breaks

2022-10-08 Thread Ihor Radchenko
Maxim Nikulin  writes:

> On 29/06/2021 10:47, James Harkins wrote:
>> * Test
>> 1本人不想亲自拿到学历学位证书、急于离校者,可书面委托他人代领学历学位证
>> 书,29日起即可离校;2本人想亲自领取学历学位证书者,按学校规定的程序及有关
>> 要求办理离校手续,领取相关证书后离校;
>
>> Exporting to ODT produces the following (body text, omitting titles,
>> headers and such).
>> 
>> 1本人不想亲自拿到学历学位证书、急于离校者,可书面委托他人代领学历学位证 书,29日起即可离校;2本人想亲自领取学历学位证书者,按学校规定的程序及有关 
>> 要求办理离校手续,领取相关证书后离校;
>
> Confirmed: newlines are copied to ODT document as is and they appear as 
> spaces in libreoffice. I did not tried HTML since I am unsure if 
> browsers should glue paragraphs with newlines into continuous string 
> without spaces. Maybe it is necessary to add some attributes for proper 
> representation (e.g. "lang"), however "#+LANGUAGE: cn" does not help 
> even though libreoffice considers paragraph as Chinese.

Newlines appearing as spaces is in ODT schema.

> As to splicing lines, I found `fill-delete-newlines' that uses 
> `fill-nospace-between-words-table' besides ?| category to determine 
> whether space should be suppressed while splicing lines. In addition 
> there are some variables to tune behavior.

I am attaching the fix that leverages `fill-region' to handle all the
complexities for us. It is the easiest way and I see no reason to look
deeper.

>From 614944ba1ac5502c7648747363674b8d45bfaaf7 Mon Sep 17 00:00:00 2001
Message-Id: <614944ba1ac5502c7648747363674b8d45bfaaf7.1665234699.git.yanta...@gmail.com>
From: Ihor Radchenko 
Date: Sat, 8 Oct 2022 21:08:47 +0800
Subject: [PATCH] ox-odt: Fix newlines replaced by spaces in Han script

* lisp/ox-odt.el (org-odt-plain-text): Use `fill-region' to unfill the
paragraphs with newlines accounting for scripts without spaces between
words.

Reported-by: James Harkins 
Link: https://orgmode.org/list/sbhnlv$4t1$1...@ciao.gmane.io
---
 lisp/ox-odt.el | 17 ++---
 1 file changed, 14 insertions(+), 3 deletions(-)

diff --git a/lisp/ox-odt.el b/lisp/ox-odt.el
index 208a39d9d..c989d2014 100644
--- a/lisp/ox-odt.el
+++ b/lisp/ox-odt.el
@@ -2903,9 +2903,20 @@ (defun org-odt-plain-text (text info)
 	(setq output
 	  (replace-regexp-in-string (car pair) (cdr pair) output t nil
 ;; Handle break preservation if required.
-(when (plist-get info :preserve-breaks)
-  (setq output (replace-regexp-in-string
-		"\\(\\)?[ \t]*\n" "" output t)))
+(if (plist-get info :preserve-breaks)
+(setq output (replace-regexp-in-string
+		  "\\(\\)?[ \t]*\n" "" output t))
+  ;; OpenDocument schema recognizes newlines as spaces, which may
+  ;; not be desired in scripts that do not separate words with
+  ;; spaces (for example, Han script).  `fill-region' is able to
+  ;; handle such situations.
+  (setq output
+(with-temp-buffer
+  (insert output)
+  ;; Unfill.
+  (let ((fill-column (point-max)))
+(fill-region (point-min) (point-max)))
+  (buffer-string
 ;; Return value.
 output))
 
-- 
2.35.1


-- 
Ihor Radchenko // yantar92,
Org mode contributor,
Learn more about Org mode at .
Support Org development at ,
or support my work at 


Re: Bug: ODT export of Chinese text inserts spaces for line breaks

2021-06-30 Thread Maxim Nikulin

On 29/06/2021 10:47, James Harkins wrote:

* Test
1本人不想亲自拿到学历学位证书、急于离校者,可书面委托他人代领学历学位证
书,29日起即可离校;2本人想亲自领取学历学位证书者,按学校规定的程序及有关
要求办理离校手续,领取相关证书后离校;



Exporting to ODT produces the following (body text, omitting titles,
headers and such).

1本人不想亲自拿到学历学位证书、急于离校者,可书面委托他人代领学历学位证 书,29日起即可离校;2本人想亲自领取学历学位证书者,按学校规定的程序及有关 
要求办理离校手续,领取相关证书后离校;


Confirmed: newlines are copied to ODT document as is and they appear as 
spaces in libreoffice. I did not tried HTML since I am unsure if 
browsers should glue paragraphs with newlines into continuous string 
without spaces. Maybe it is necessary to add some attributes for proper 
representation (e.g. "lang"), however "#+LANGUAGE: cn" does not help 
even though libreoffice considers paragraph as Chinese.


On 30/06/2021 01:19, Eric Abrahamsen wrote:

There are a few ways to approach this:

(aref char-script-table ?中) -> 'han
(string-match-p "\\cc" "中") -> 0
(aref (char-category-set ?中) ?|) -> t


Thank you. I have not noticed all features hidden behind \c. I believe,

(rx (category can-break))

is more readable and I am a bit surprised that there is no descriptive 
aliases char-categories such as ?|. Just to add another example:


(category-set-mnemonics (char-category-set ?ф)) -> ".LYchjy"

and `describe-categories' to decipher it.

As to splicing lines, I found `fill-delete-newlines' that uses 
`fill-nospace-between-words-table' besides ?| category to determine 
whether space should be suppressed while splicing lines. In addition 
there are some variables to tune behavior.





Re: Bug: ODT export of Chinese text inserts spaces for line breaks

2021-06-29 Thread Eric Abrahamsen
Maxim Nikulin  writes:

> On 29/06/2021 10:47, James Harkins wrote:
>> So, it would make sense to add a rule to the exporter: if one of the
>> characters before or after a source-text line break is a Chinese,
>> Japanese or Korean character, do not add a space.
>
> On 29/06/2021 11:43, tumashu wrote:
>> You can try the below config :-)
>>      (let ((regexp "[[:multibyte:]]")
>>    (string text))
>>    (setq string
>>      (replace-regexp-in-string
>>   (format "\\(%s\\) *\n *\\(%s\\)" regexp regexp)
>>   "\\1\\2" string))
>
> Notice that [[:multibyte:]] means almost any non-ASCII script, e.g.
> Cyrillic:
>
> (let ((sample "abc абв def"))
>   (and (string-match "[[:multibyte:]]\+" sample)
>(match-string 0 sample)))
> "абв"
>
> It seems, `org-fill-paragraph' M-q is smart enough to avoid a space
> before or after a CJK character, so it is possible to determine
> correct way to splice lines, despite e.g. "Script" Unicode property is
> not exposed to elisp:
> https://www.gnu.org/software/emacs/manual/html_node/elisp/Character-Properties.html
> (Anyway maintaining explicit list of scripts is not a straightforward
> approach.)

There are a few ways to approach this:

(aref char-script-table ?中) -> 'han

(string-match-p "\\cc" "中") -> 0

(aref (char-category-set ?中) ?|) -> t



Re: Bug: ODT export of Chinese text inserts spaces for line breaks

2021-06-29 Thread Maxim Nikulin

On 29/06/2021 10:47, James Harkins wrote:

So, it would make sense to add a rule to the exporter: if one of the
characters before or after a source-text line break is a Chinese,
Japanese or Korean character, do not add a space.


On 29/06/2021 11:43, tumashu wrote:

You can try the below config :-)
     (let ((regexp "[[:multibyte:]]")
   (string text))
   (setq string
     (replace-regexp-in-string
  (format "\\(%s\\) *\n *\\(%s\\)" regexp regexp)
  "\\1\\2" string))


Notice that [[:multibyte:]] means almost any non-ASCII script, e.g. 
Cyrillic:


(let ((sample "abc абв def"))
  (and (string-match "[[:multibyte:]]\+" sample)
   (match-string 0 sample)))
"абв"

It seems, `org-fill-paragraph' M-q is smart enough to avoid a space 
before or after a CJK character, so it is possible to determine correct 
way to splice lines, despite e.g. "Script" Unicode property is not 
exposed to elisp: 
https://www.gnu.org/software/emacs/manual/html_node/elisp/Character-Properties.html 
(Anyway maintaining explicit list of scripts is not a straightforward 
approach.)


P.S.
JavaScript in browsers allows to filter characters that belong to 
particular script:


"abc абв def".match(/\p{Script=Cyrillic}+/u)
Array [ "абв" ]

I have not found such feature in regular expressions available in Emacs.




Bug: ODT export of Chinese text inserts spaces for line breaks

2021-06-28 Thread James Harkins
Consider the following org document.

* Test
1本人不想亲自拿到学历学位证书、急于离校者,可书面委托他人代领学历学位证
书,29日起即可离校;2本人想亲自领取学历学位证书者,按学校规定的程序及有关
要求办理离校手续,领取相关证书后离校;

This was produced by pasting in a single, long line, and then using alt-Q (a 
normal thing to do, and good for readability, because org-mode doesn't wrap 
lines by default).

Exporting to ODT produces the following (body text, omitting titles, headers 
and such).

1本人不想亲自拿到学历学位证书、急于离校者,可书面委托他人代领学历学位证 书,29日起即可离校;2本人想亲自领取学历学位证书者,按学校规定的程序及有关 
要求办理离校手续,领取相关证书后离校;

Between 证 and 书, and between 关 and 要, there is a space. Chinese typography does 
not allow for spaces mid-sentence.

So, it would make sense to add a rule to the exporter: if one of the characters 
before or after a source-text line break is a Chinese, Japanese or Korean 
character, do not add a space. (The space is valid, of course, if the 
characters on either side of the line breaks are Roman or [I would guess] 
Cyrillic as well.)

(Side note: Exporting to a LaTeX buffer shows that the line breaks have been 
copied into the .tex document as is -- but, provided that you have a 
`usepackage{xeCJK}` in the preamble, LaTeX produces correct, space-free output. 
So -- Org "gets away with it" because of LaTeX's handling of CJK text. It seems 
for ODT, Org needs to handle the spacing within its own logic.)

This is org 9.1.9... bit old, I know, but I'm gonna take a wild guess that this 
has not been a high-visibility issue.

hjh