"Stephen J. Turnbull" <[email protected]>:
> Just read as bytes and decode piecewise in one way or another. For
> Oleg's HTML case, there's a well-understood structure that can be used
> to determine retry points
HTML and XML are interesting examples since their encoding is initially
unknown:
<?xml version="1.0"?>
^
+--- Now I know it is UTF-8
<?xml version="1.0" encoding="UTF-16"?>
^
+--- Now I know it was UTF-16
all along!
Then we have:
HTTP/1.1 200 OK
Content-Type: text/html; charset=ISO-8859-1
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-16">
See how deep you have to parse the TCP stream before you realize the
content encoding is UTF-16.
Marko
_______________________________________________
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe:
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com