Re: Non-English (Chinese/Japanese) period and comma to be treated as punctuation characters

Stuart Rackham Wed, 07 Nov 2012 20:55:50 -0800


On 08/11/12 15:17, Lex Trotman wrote:
> 
> 
> 
> On 8 November 2012 09:58, Jeremy <[email protected]
> <mailto:[email protected]>> wrote:
> 
>     Hi Tong,
> 
>     Thanks for your quick response.
> 
>     You can find the full discussion here,
> 
>       
> https://groups.google.com/__group/asciidoc/browse_thread/__thread/f7bf7cb23d6253d3
>     
> <https://groups.google.com/group/asciidoc/browse_thread/thread/f7bf7cb23d6253d3>
> 
>     Attached please find the archive containing asciidoc.py
>     (asciidoc.py.bak) and asciidoc.conf (asciidoc.conf.bak). Where *.bak
>     comes from AsciiDoc 8.6.8, and their counterparts (without extension
>     .bak) are modified ones.
> 
>     You can try the following content with both non-patched and patched
>     versions, and will find the difference and what I am trying to achieve.
> 
>       舉例來說，`-a stylesheet=newsletter.css` 的設定會採用樣式表
>     `newsletter.css`。
> 
>     What I have done is removing the Unicode flag while parsing inline
>     literals. I would like to know does this change incur other
>     unexpected errors?
> 
> 
> Hi,
> 
> Unfortunately probably.
> 
> The problem isn't that the regular expression uses Unicode properties,
> it should, the problem is that the strings being searched are not
> consistently Unicode strings and this has an effect on the behaviour of
> the regular expression engine.  See this example
> 
> import re, unicodedata
> a = "舉例來說，`-a stylesheet=newsletter.css` 的設定會採用樣式表
> `newsletter.css`。"
> u = u"舉例來說，`-a stylesheet=newsletter.css` 的設定會採用樣式表
> `newsletter.css`。"
> print
> re.findall(r"(?su)(?<![`\w])([\\]?`(?P<passtext>[^`\s]|[^`\s].*?\S)`)(?![`\w])",
> a)
> print
> re.findall(r"(?s)(?<![`\w])([\\]?`(?P<passtext>[^`\s]|[^`\s].*?\S)`)(?![`\w])",
> a)
> print
> re.findall(r"(?su)(?<![`\w])([\\]?`(?P<passtext>[^`\s]|[^`\s].*?\S)`)(?![`\w])",
> u)
> print
> re.findall(r"(?s)(?<![`\w])([\\]?`(?P<passtext>[^`\s]|[^`\s].*?\S)`)(?![`\w])",
> u)
> print unicodedata.category( u"。" )
> 
> gives 
> 
> [('`-a stylesheet=newsletter.css`', '-a stylesheet=newsletter.css')]
> [('`-a stylesheet=newsletter.css`', '-a stylesheet=newsletter.css'),
> ('`newsletter.css`', 'newsletter.css')]
> [(u'`-a stylesheet=newsletter.css`', u'-a stylesheet=newsletter.css'),
> (u'`newsletter.css`', u'newsletter.css')]
> [(u'`-a stylesheet=newsletter.css`', u'-a stylesheet=newsletter.css'),
> (u'`newsletter.css`', u'newsletter.css')]
> Po
> 
> Note that it gets it wrong with a non-unicode string.  As shown by the
> last item, the 。 character *is* known to be a punctuation, that isn't
> the problem, the problem is that if the string isn't a Unicode string
> the re engine just looks at bytes, not code points.
> 
> When Asciidoc was written Python didn't consistently handle Unicode
> strings, so Asciidoc didn't.
> 
> @Stuart, how old a Python do you want to support?  Can all string
> handling be changed to Unicode?  That of course must happen to support
> Python 3 anyway.


Hi Lex, I've attached some off-the-cuff notes I made to myself a 6 to 12
months ago re porting to Python 3. I haven't done any coding on it and
don't know how much sense the notes make.


Cheers, Stuart


> 
> Cheers
> Lex
> 
> PS Sorry if the Python above gets munged by the mailer
> 
> 
>     Thanks,
>     Jeremy
> 
>     On Thursday, January 28, 2010 10:52:22 PM UTC+8, Tong wrote:
> 
>         On Jan 27, 9:13 am, sardine <[email protected]> wrote:
> 
>         >   舉例來說，`-a stylesheet=newsletter.css` 的設定會採用樣式表
>         `newsletter.css`。
>         >
>         > "-a stylesheet=newsletter.css" will be interpreted as monospaced
>         > correctly. but "`newsletter.css`。" won't.
>         >
>         > The problem comes from that "Chinese period (。)  and comma
>         (，) not be
>         > treated as punctuation characters".... The same problem also
>         arised in
>         > Japanese environment.
> 
>         I don't think it is merely the Chinese punctuation characters'
>         (。，...)
>         problem. I bet that if you remove the space after `-a
>         stylesheet=newsletter.css`, what was handled correctly would not be
>         any more.
> 
>         > Is it possible to define punctuation characters for specific
>         language?
>         > or the rule specified in User Guide will not apply to non-English
>         > environments.
> 
>         Due to above reason, I don't think such approach is feasible.
> 
>         BTW, if we do go with that route, please bear in mind that there are
>         many Chinese punctuation characters, and we need to make sure all of
>         them are included. Here is a list to start with:
>         ·、，；。！？''""【】《》（）......
> 
>         Further, the three different encoding, GB/Big5/UTF8 should be all
>         considered as well.
> 
>         My suggested alternatives:
> 
>         - no change to asciidoc -- just add a space after the closing ` -- I
>         know this is ugly.
>         - change asciidoc so that each Chinese character are considered as
>         punctuation for ``.
> 
>     -- 
>     You received this message because you are subscribed to the Google
>     Groups "asciidoc" group.
>     To view this discussion on the web visit
>     https://groups.google.com/d/msg/asciidoc/-/eXm4blS1B0UJ.
>     To post to this group, send email to [email protected]
>     <mailto:[email protected]>.
>     To unsubscribe from this group, send email to
>     [email protected]
>     <mailto:asciidoc%[email protected]>.
>     For more options, visit this group at
>     http://groups.google.com/group/asciidoc?hl=en.
> 
> 
> -- 
> You received this message because you are subscribed to the Google
> Groups "asciidoc" group.
> To post to this group, send email to [email protected].
> To unsubscribe from this group, send email to
> [email protected].
> For more options, visit this group at
> http://groups.google.com/group/asciidoc?hl=en.

-- 
You received this message because you are subscribed to the Google Groups 
"asciidoc" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/asciidoc?hl=en.

= AsciiDoc Python 3 port

Read the 'String Changes in 3.0' in 'Learning Python 4th Ed.' first.

I haven't got beyond the planning stage, but here are the proposed
conventions going forward:

. UTF-8 is the default encoding (no change here).
. All configuration (.conf) files to be UTF-8 encoded (afaik all
  current .conf files are UTF-8).
. The AsciiDoc 'encoding' attribute sets the encoding of source
  files and output files (no change here).
. The setting of the 'encoding' attribute in AsciiDoc source documents
  is prohibited (you have to set it on the command-line or from
  configuration files).

In theory at least, the last rule (to avoid a Catch-22) would
introduce a backward incompatibility because currently the User Guide
states ``The 'encoding' attribute can be set using an AttributeEntry
inside the document header''. But this is broken anyway in that it
only applies to character sets that are backward compatible with ASCII
e.g.  ISO-8859-1 (latin-1).

Software should only work with Unicode strings internally, converting
to a particular encoding on output.

Port to Python 3 via 2.6, this is how Django are doing it:

``deprecate older 2.x releases until our minimum requirement is Python
2.6, then to take advantage of the compatibility features in 2.6 to
carry out the actual porting and achieve Python 3 support''

The idea will be to have a Python 2.6 version that can be
automatically converted to Python 3 using `2to3` with a '2to3' AAP
rule.

  2to3 -w -f idioms -f all a2x3.py
  2to3 -w -f idioms -f all -x next asciidoc3.py

Use `sys.version_info >= (3, 0)` to test for Python 3.

Need to replace all open() calls with:

  def file_open(filename, mode='r', encoding=None):
      if not encoding:
          encoding = document.attributes.get('encoding', 'UTF-8')
      return codecs.open(filename, mode, encoding, errors='strict')

. All AsciiDoc distribution text files are UTF-8 encoded.
. The 'encoding' attribute sets the encoding of input and output files
  (defaults to UTF-8).
. The use of the 'encoding' attribute in the document header is prohibited
  ???  unless the encoding of the header is compatible with UTF-8 e.g.
  ISO-8859-1 (latin-1)

What exactly is the encoding of text from stdin on Linux and Windows?
See:

- stdout encoding is set by the OS environment and is NOT
  sys.getdefaultencoding(), you can read it with sys.stdout.encoding
  but it can only be set externally (see
  
https://drj11.wordpress.com/2007/05/14/python-how-is-sysstdoutencoding-chosen/)
  Thankfully on Linux this is normally UTF-8.
  Things aren't so simple with Windows
  
(http://superuser.com/questions/239810/setting-utf8-as-default-character-encoding-in-windows-7).


Closing the points of entry:
. Reader to have 'encoding' attribute so includes get the right
  encoding.
. Text from asciidoc filter needs to be read with correct encoding.
. Text from `{sys:}` and `{eval:}` etal needs to be read with correct
  encoding.

Drop the 'newline' attribute -- it's just an unnecessary complication.

The by disabling binary file modes and encode() and decode() I was
able to get asciidoc3.py to work (search for ZZZ's in asciidoc3.py).
But need to rewrite the file
read/write with
http://docs.python.org/howto/unicode.html#reading-and-writing-unicode-data

See:
* http://docs.python.org/dev/howto/pyporting.html
* http://docs.python.org/whatsnew/2.6.html

Define a new intrinsic attribute (set by at in asciidoc.py) `{py3}`
which would be defined if we're using a Python 3 interpreter.
Could then be used to synthesise Python 3 filter names in conf
files e.g.

  filter='graphviz2png{py3?3}.py {verbose?-v} -o 
"{outdir={indir}}/{imagesdir=}{imagesdir?/}{target}" -L {layout=dot} -F 
{format=png} -'

* Use the python `-3` option to check code.
* Switch to Git? Google code supports Git, see also:
  https://code.google.com/p/support/wiki/ConvertingSvnToGit
  https://code.google.com/p/support/wiki/GitFAQ

I ran `2to3 asciidoc.py`, there are a couple of odd things I need
to check out:

See
http://diveintopython3.ep.io/porting-code-to-python-3-with-2to3.html#next

----
-            if Lex.next() is not Title:
+            if next(Lex) is not Title:

-        if len(self.next) <= self.READ_BUFFER_MIN:
+        if len(self.__next__) <= self.READ_BUFFER_MIN:
----


See
http://diveintopython3.ep.io/porting-code-to-python-3-with-2to3.html#dict
Most of these transforms are unnecessary as they're are just being
used for immediate iteration and nothing more.

----
-                for k in d.keys():
+                for k in list(d.keys()):
----

See
1. http://docs.python.org/py3k/library/2to3.html
// This next link is brillant!
2. http://diveintopython3.ep.io/porting-code-to-python-3-with-2to3.html

Re: Non-English (Chinese/Japanese) period and comma to be treated as punctuation characters

Reply via email to