James G. sack (jim) added the comment:

Feature Request REVISION
========================
Upon reflection and more playing around with some test cases, I wish to 
revise my feature request.

I think the utf8 codecs should accept input with or without the "sig".
On output, only the utf_8_sig should write the 3-byte "sig". This behavior 
change would not seem disruptive to current applications. 

For utf16, (arguably) a missing BOM should merely assume machian endianess.
For utf_16_le, utf_16_be input, both should accept & discard a BOM.
On output, I'm not sure; maybe all should write a BOM unless passed a flag 
signifying no bom? 
Or to preserve backward compat, could have a parm write_bom defaulting to 
True for utf16 and False for utf_16_le and utf_16_be. This is a 
modification of the originial request (for a force_bom flag).  

Unless I have confused myself with my test cases, the current codecs are 
slightly inconsistent for the utf8 codecs:

utf8 treats "sig" as real data, if present, but..
utf_8_sig works right even without the "sig" (so this one I like as is!)

The 16'ers seem to match the (inferred) specs, but for completeness here:
utf_16 refuses to proceed w/o BOM (even with correct endian input data)
utf_16_le treats BOM as data
utf_16_be treats BOM as data

Regards,
..jim

__________________________________
Tracker <[EMAIL PROTECTED]>
<http://bugs.python.org/issue1328>
__________________________________
_______________________________________________
Python-bugs-list mailing list 
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

Reply via email to