Re: [Python-Dev] Bytes path support

Marko Rauhamaa Sat, 23 Aug 2014 02:04:53 -0700

"Stephen J. Turnbull" <[email protected]>:

> Just read as bytes and decode piecewise in one way or another. For
> Oleg's HTML case, there's a well-understood structure that can be used
> to determine retry points


HTML and XML are interesting examples since their encoding is initially
unknown:

  <?xml version="1.0"?>
                      ^
                      +--- Now I know it is UTF-8

  <?xml version="1.0" encoding="UTF-16"?>
                                      ^
                                      +--- Now I know it was UTF-16
                                           all along!

Then we have:


  HTTP/1.1 200 OK
  Content-Type: text/html; charset=ISO-8859-1

  <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
  <html>
  <head>
  <meta http-equiv="Content-Type" content="text/html; charset=utf-16">

See how deep you have to parse the TCP stream before you realize the
content encoding is UTF-16.


Marko
_______________________________________________
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Bytes path support

Reply via email to