Apologies for the first copy going to the wrong discussion (I reply-to from Thunderbird and cut and pasted, somehow the original source was remembered and used by Google to attach it to the original thread0.
When AsciiDoc was first released, almost a decade ago, it could only process ASCII characters, compatibility with other character sets has been achieved over time as problems arose with a succession of ad hoc kludges. Instead of reading and decoding all text to Unicode AsciiDoc reads the text as binary, where necessary it attempts to decode and encode internally, but mostly it just passes it through to the output. Not surprisingly it's all a bit of a mess. In hindsight I'm astonished that has worked as well as it has. Moving forward there will be no Python 3 port until character handling has been rewritten to process all text inputs internally as Unicode strings (I was able to port asciidoc to Python 3 fairly trivially, things only really unraveled with non-ASCII character handling). It's also high time the text encoding rules were formalized. I haven't got beyond the planning stage, but here are the proposed conventions going forward: . UTF-8 is the default encoding (no change here). . All configuration (.conf) files to be UTF-8 encoded (afaik all current .conf files are UTF-8). . The AsciiDoc 'encoding' attribute is sets the encoding of source files and output files (no change here). . The setting of the 'encoding' attribute in AsciiDoc source documents is prohibited (you have to set it on the command-line or from configuration files). In theory at least, the last rule (to avoid a Catch-22) would introduce a backward compatibility because currently the User Guide states ``The 'encoding' attribute can be set using an AttributeEntry inside the document header''. But this is broken anyway in that it only applies to character sets that are backward compatible with ASCII e.g. ISO-8859-1 (latin-1). Another potential source of problems could be when writing to stdout instead of a file (afaik stdout and stdin encodings are set in the OS execution environment and can't be changed by the executing process). I'm no expert on multi-lingual character sets or Unicode so any comments, thoughts or suggestions would be welcome. Cheers, Stuart -- You received this message because you are subscribed to the Google Groups "asciidoc" group. To post to this group, send email to [email protected]. To unsubscribe from this group, send email to [email protected]. For more options, visit this group at http://groups.google.com/group/asciidoc?hl=en.
