Unicode internals (2nd post)

Stuart Rackham Tue, 20 Mar 2012 16:13:54 -0700

Apologies for the first copy going to the wrong discussion (I reply-to from
Thunderbird and cut and pasted, somehow the original source was remembered
and used by Google to attach it to the original thread0.



When AsciiDoc was first released, almost a decade ago, it could only
process ASCII characters, compatibility with other character sets has
been achieved over time as problems arose with a succession of ad hoc
kludges. Instead of reading and decoding all text to Unicode AsciiDoc
reads the text as binary, where necessary it attempts to decode and
encode internally, but mostly it just passes it through to the output.
Not surprisingly it's all a bit of a mess.  In hindsight I'm
astonished that has worked as well as it has.

Moving forward there will be no Python 3 port until character handling
has been rewritten to process all text inputs internally as Unicode
strings (I was able to port asciidoc to Python 3 fairly trivially,
things only really unraveled with non-ASCII character handling). It's
also high time the text encoding rules were formalized.

I haven't got beyond the planning stage, but here are the proposed
conventions going forward:

. UTF-8 is the default encoding (no change here).
. All configuration (.conf) files to be UTF-8 encoded (afaik all current
  .conf files are UTF-8).
. The AsciiDoc 'encoding' attribute is sets the encoding of source
   files and output files (no change here).
. The setting of the 'encoding' attribute in AsciiDoc source documents
  is prohibited (you have to set it on the command-line or from
  configuration files).

  In theory at least, the last rule (to avoid a Catch-22) would
introduce a backward compatibility because currently the User Guide
states ``The 'encoding' attribute can be set using an AttributeEntry
inside the document header''. But this is broken anyway in that it
only applies to character sets that are backward compatible with ASCII
e.g.  ISO-8859-1 (latin-1).

Another potential source of problems could be when writing to stdout
instead of a file (afaik stdout and stdin encodings are set in the OS
execution environment and can't be changed by the executing process).

I'm no expert on multi-lingual character sets or Unicode so any
comments, thoughts or suggestions would be welcome.

Cheers, Stuart


--
You received this message because you are subscribed to the Google Groups 
"asciidoc" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/asciidoc?hl=en.

Unicode internals (2nd post)

Reply via email to