Re: Non-English (Chinese/Japanese) period and comma to be treated as punctuation characters

Lex Trotman Wed, 07 Nov 2012 18:17:38 -0800

On 8 November 2012 09:58, Jeremy <[email protected]> wrote:

> Hi Tong,
>
> Thanks for your quick response.
>
> You can find the full discussion here,
>
>   https://groups.google.com/**group/asciidoc/browse_thread/**
> thread/f7bf7cb23d6253d3<https://groups.google.com/group/asciidoc/browse_thread/thread/f7bf7cb23d6253d3>
>
> Attached please find the archive containing asciidoc.py (asciidoc.py.bak)
> and asciidoc.conf (asciidoc.conf.bak). Where *.bak comes from AsciiDoc
> 8.6.8, and their counterparts (without extension .bak) are modified ones.
>
> You can try the following content with both non-patched and patched
> versions, and will find the difference and what I am trying to achieve.
>
>   舉例來說，`-a stylesheet=newsletter.css` 的設定會採用樣式表 `newsletter.css`。
>
> What I have done is removing the Unicode flag while parsing inline
> literals. I would like to know does this change incur other unexpected
> errors?
>


Hi,

Unfortunately probably.

The problem isn't that the regular expression uses Unicode properties, it
should, the problem is that the strings being searched are not consistently
Unicode strings and this has an effect on the behaviour of the regular
expression engine.  See this example

import re, unicodedata
a = "舉例來說，`-a stylesheet=newsletter.css` 的設定會採用樣式表 `newsletter.css`。"
u = u"舉例來說，`-a stylesheet=newsletter.css` 的設定會採用樣式表 `newsletter.css`。"
print
re.findall(r"(?su)(?<![`\w])([\\]?`(?P<passtext>[^`\s]|[^`\s].*?\S)`)(?![`\w])",
a)
print
re.findall(r"(?s)(?<![`\w])([\\]?`(?P<passtext>[^`\s]|[^`\s].*?\S)`)(?![`\w])",
a)
print
re.findall(r"(?su)(?<![`\w])([\\]?`(?P<passtext>[^`\s]|[^`\s].*?\S)`)(?![`\w])",
u)
print
re.findall(r"(?s)(?<![`\w])([\\]?`(?P<passtext>[^`\s]|[^`\s].*?\S)`)(?![`\w])",
u)
print unicodedata.category( u"。" )

gives

[('`-a stylesheet=newsletter.css`', '-a stylesheet=newsletter.css')]
[('`-a stylesheet=newsletter.css`', '-a stylesheet=newsletter.css'),
('`newsletter.css`', 'newsletter.css')]
[(u'`-a stylesheet=newsletter.css`', u'-a stylesheet=newsletter.css'),
(u'`newsletter.css`', u'newsletter.css')]
[(u'`-a stylesheet=newsletter.css`', u'-a stylesheet=newsletter.css'),
(u'`newsletter.css`', u'newsletter.css')]
Po

Note that it gets it wrong with a non-unicode string.  As shown by the last
item, the 。 character *is* known to be a punctuation, that isn't the
problem, the problem is that if the string isn't a Unicode string the re
engine just looks at bytes, not code points.

When Asciidoc was written Python didn't consistently handle Unicode
strings, so Asciidoc didn't.

@Stuart, how old a Python do you want to support?  Can all string handling
be changed to Unicode?  That of course must happen to support Python 3
anyway.

Cheers
Lex

PS Sorry if the Python above gets munged by the mailer


> Thanks,
> Jeremy
>
> On Thursday, January 28, 2010 10:52:22 PM UTC+8, Tong wrote:
>>
>> On Jan 27, 9:13 am, sardine <[email protected]> wrote:
>>
>> >   舉例來說，`-a stylesheet=newsletter.css` 的設定會採用樣式表 `newsletter.css`。
>> >
>> > "-a stylesheet=newsletter.css" will be interpreted as monospaced
>> > correctly. but "`newsletter.css`。" won't.
>> >
>> > The problem comes from that "Chinese period (。)  and comma (，) not be
>> > treated as punctuation characters".... The same problem also arised in
>> > Japanese environment.
>>
>> I don't think it is merely the Chinese punctuation characters' (。，...)
>> problem. I bet that if you remove the space after `-a
>> stylesheet=newsletter.css`, what was handled correctly would not be
>> any more.
>>
>> > Is it possible to define punctuation characters for specific language?
>> > or the rule specified in User Guide will not apply to non-English
>> > environments.
>>
>> Due to above reason, I don't think such approach is feasible.
>>
>> BTW, if we do go with that route, please bear in mind that there are
>> many Chinese punctuation characters, and we need to make sure all of
>> them are included. Here is a list to start with:
>> ·、，；。！？''""【】《》（）......
>>
>> Further, the three different encoding, GB/Big5/UTF8 should be all
>> considered as well.
>>
>> My suggested alternatives:
>>
>> - no change to asciidoc -- just add a space after the closing ` -- I
>> know this is ugly.
>> - change asciidoc so that each Chinese character are considered as
>> punctuation for ``.
>>
>>  --
> You received this message because you are subscribed to the Google Groups
> "asciidoc" group.
> To view this discussion on the web visit
> https://groups.google.com/d/msg/asciidoc/-/eXm4blS1B0UJ.
> To post to this group, send email to [email protected].
> To unsubscribe from this group, send email to
> [email protected].
> For more options, visit this group at
> http://groups.google.com/group/asciidoc?hl=en.
>

-- 
You received this message because you are subscribed to the Google Groups 
"asciidoc" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/asciidoc?hl=en.

Re: Non-English (Chinese/Japanese) period and comma to be treated as punctuation characters

Reply via email to