On 27/10/2007, at 14:55, Michel Fortin wrote:

[...]
Now, the interesting question is: what should PHP Markdown (or any Markdown implementation for that matter) do with the UTF-8 BOM? Here are three options:

1. Remove it?
2. Keep it at the start of the text?
3. Ignore it (as it does now)?

Option 3 seems a logical option to me

Yes, ignore it!

[...]
Between option 1 and 2, surely option 1 (dropping the BOM) is the best. Otherwise it'd be hard to concatenate the output with a template HTML document.

And that is why the user should not have placed the BOM in an UTF-8 file in the first place ;)

UTF-8 is an ASCII superset that makes 99% of existing programs that deal with ASCII work flawlessly with the text. Add the BOM and you break that, i.e. using ‘cat’ to concatenate files will result in BOMs in the middle of the result, use ‘grep’ to extract stuff, and you may or may not get a BOM in the result, use a shebang line and find the shell (execv()) won’t actually read it, save your C source with a BOM and gcc will choke on it, etc.

The BOM is a byte-order-marker for UTF-16, it has no place in UTF-8. Some may argue it is there to indicate that the file is UTF-8, but UTF-8 can already be recognized with >99% certainty w/o the BOM, so the BOM doesn’t really help here, and when text is sent over the wire, there generally is a specified default encoding and a way to change that, which does not include adding garbage to the start of the file (and to the best of my knowledge no standard calls for the examination of the first 3 bytes to determine encoding).

[...]
UTF-8 BOM handling sounds like a good thing to add to MDTest too.

I’d say no -- on the contrary, if the user adds a BOM to his UTF-8 file he should be told that this is a bad idea. Fortunately none of the text editors on my system even has this option ;)

_______________________________________________
Markdown-Discuss mailing list
[email protected]
http://six.pairlist.net/mailman/listinfo/markdown-discuss

Reply via email to