Re: Unicode internals (2nd post)

Lex Trotman Tue, 20 Mar 2012 17:17:17 -0700

On 21 March 2012 10:13, Stuart Rackham <[email protected]> wrote:
> Apologies for the first copy going to the wrong discussion (I reply-to from
> Thunderbird and cut and pasted, somehow the original source was remembered
> and used by Google to attach it to the original thread0.
>
>
> When AsciiDoc was first released, almost a decade ago, it could only
> process ASCII characters, compatibility with other character sets has
> been achieved over time as problems arose with a succession of ad hoc
> kludges. Instead of reading and decoding all text to Unicode AsciiDoc
> reads the text as binary, where necessary it attempts to decode and
> encode internally, but mostly it just passes it through to the output.
> Not surprisingly it's all a bit of a mess.  In hindsight I'm
> astonished that has worked as well as it has.
>
> Moving forward there will be no Python 3 port until character handling
> has been rewritten to process all text inputs internally as Unicode
> strings (I was able to port asciidoc to Python 3 fairly trivially,
> things only really unraveled with non-ASCII character handling). It's
> also high time the text encoding rules were formalized.
>
> I haven't got beyond the planning stage, but here are the proposed
> conventions going forward:
>
> . UTF-8 is the default encoding (no change here).
> . All configuration (.conf) files to be UTF-8 encoded (afaik all current
>  .conf files are UTF-8).
> . The AsciiDoc 'encoding' attribute is sets the encoding of source
>   files and output files (no change here).
> . The setting of the 'encoding' attribute in AsciiDoc source documents
>  is prohibited (you have to set it on the command-line or from
>  configuration files).
>
>  In theory at least, the last rule (to avoid a Catch-22) would
> introduce a backward compatibility because currently the User Guide
> states ``The 'encoding' attribute can be set using an AttributeEntry
> inside the document header''. But this is broken anyway in that it
> only applies to character sets that are backward compatible with ASCII
> e.g.  ISO-8859-1 (latin-1).
>
> Another potential source of problems could be when writing to stdout
> instead of a file (afaik stdout and stdin encodings are set in the OS
> execution environment and can't be changed by the executing process).
>
> I'm no expert on multi-lingual character sets or Unicode so any
> comments, thoughts or suggestions would be welcome.


Hi Stuart,

Encodings should die!!! Don't encourage them!!  Oh, ok, thats a bit harsh :)

What we do on another (non-Python) project that seems to work well in
practice is:

1. All internal processing is Unicode (UTF-8 encoded, but that is
implementation defined)

2. If the user specifies an encoding on the command line, use that and
if it doesn't convert or fails to validate complain and exit.

3. If the file has a BOM try the encoding that implies, if it fails to
convert or validate keep going.

4. Look at the beginning of the file (3 lines and <512 chars IIRC) and
search for the ASCII regex: "coding[\t ]*[:=][\t
]*\"?([a-zA-Z0-9-_]+)\"?[\t ]*" where the match group is a name that
convert knows about.  If this exists and fails to convert and validate
complain and exit. This matches most of the encoding marks editors
use.  It would usually be in a comment near the start of the file,
although some editors (eg your favourite VIM IIUC) support near the
end as well, but that involves reading the whole file as ascii so we
don't do it.

5. Try the system default encoding on the assumption  that files are
created using that and so its a likely candidate.  If it fails to
convert and validate keep going.

6. Try loading as UTF-8 and if it fails to validate complain and exit.

Unless the user specifies otherwise outputs are written in the system encoding.

Note that this requires that the whole file be read to convert and
validate.  That suits us, but may not be appropriate to asciidoc, in
which case the first encoding specification found from 2, 3 or 4
should be used and fall back to 6 (UTF-8).

Cheers
Lex

>
> Cheers, Stuart
>
>
> --
> You received this message because you are subscribed to the Google Groups
> "asciidoc" group.
> To post to this group, send email to [email protected].
> To unsubscribe from this group, send email to
> [email protected].
> For more options, visit this group at
> http://groups.google.com/group/asciidoc?hl=en.
>

-- 
You received this message because you are subscribed to the Google Groups 
"asciidoc" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/asciidoc?hl=en.

Re: Unicode internals (2nd post)

Reply via email to