On 10月18日, 上午12时50分, "Diez B. Roggisch" <de...@nospam.web.de> wrote: > StarWing schrieb: > > > > > On 10月17日, 下午9时54分, Arian Kuschki <arian.kusc...@googlemail.com> > > wrote: > >> Hi all > > >> this has been bugging me for a long time and I do not seem to be able to > >> understand what to do. I always have problems when dealing input text that > >> contains umlauts. Consider the following: > > >> In [1]: import urllib > > >> In [2]: f = urllib.urlopen("http://www.google.de/ig/api?weather=Muenchen") > > >> In [3]: xml = f.read() > > >> In [4]: f.close() > > >> In [5]: print xml > >> ------> print(xml) > >> <?xml version="1.0"?><xml_api_reply version="1"><weather module_id="0" > >> tab_id="0" mobile_row="0" mobile_zipped="1" row="0" > >> section="0"><forecast_information><cit > > >> y data="Munich, BY"/><postal_code data="Muenchen"/><latitude_e6 > >> data=""/><longitude_e6 data=""/><forecast_date > >> data="2009-10-17"/><current_date_time data="2009-10 > >> -17 14:20:00 +0000"/><unit_system > >> data="SI"/></forecast_information><current_conditions><condition > >> data="Meistens > >> bew kt"/><temp_f data="43"/><temp_c data="6"/><h > >> umidity data="Feuchtigkeit: 87 %"/><icon > >> data="/ig/images/weather/mostly_cloudy.gif"/><wind_condition data="Wind: W > >> mit > >> Windgeschwindigkeiten von 13 km/h"/></curr > >> ent_conditions><forecast_conditions><day_of_week data="Sa."/><low > >> data="1"/><high data="7"/><icon > >> data="/ig/images/weather/chance_of_rain.gif"/><condition data="V > >> ereinzelt Regen"/></forecast_conditions><forecast_conditions><day_of_week > >> data="So."/><low data="-1"/><high data="8"/><icon > >> data="/ig/images/weather/chance_of_sno > >> w.gif"/><condition data="Vereinzelt > >> Schnee"/></forecast_conditions><forecast_conditions><day_of_week > >> data="Mo."/><low data="-4"/><high data="8"/><icon data="/ig/i > >> mages/weather/mostly_sunny.gif"/><condition data="Teils > >> sonnig"/></forecast_conditions><forecast_conditions><day_of_week > >> data="Di."/><low data="0"/><high data="8" > >> /><icon data="/ig/images/weather/sunny.gif"/><condition > >> data="Klar"/></forecast_conditions></weather></xml_api_reply> > > >> As you can see the umlauts in the XML are not displayed properly. When I > >> want > >> to process this text (for example with xml.sax), I get error messages > >> because > >> the parses can't read this. > > >> I've tried to read up on this and there is a lot of information on the > >> web, but > >> nothing seems to work for me. For example setting the coding to UTF like > >> this: > >> # -*- coding: utf-8 -*- or using the decode() string method. > > >> I always have this kind of problem when input contains umlauts, not just in > >> this case. My locale (on Ubuntu) is en_GB.UTF-8. > > >> Cheers > >> Arian > > > try this? > > > # vim: set fencoding=utf-8: > > import urllib > > import xml.sax as sax, xml.sax.handler as handler > > > f = urllib.urlopen("http://www.google.de/ig/api?weather=Muenchen") > > xml = f.read() > > xml = xml.decode("cp1252") > > f.close() > > > class my_handler(handler.ContentHandler): > > def startElement(self, name, attrs): > > print "begin:", name, attrs > > > def endElement(self, name): > > print "end:", name > > > sax.parseString(xml, my_handler()) > > This is wrong. XML is a *byte*-based format, which explicitly states > encodings. So decoding a byte-string to a unicode-object and then > passing it to a parser is not working in the very moment you have data that > > - is outside your default-system-encoding (ususally ascii) > - the system-encoding and the declared decoding differ > > Besides, I don't see where the whole SAX-stuff is supposed to do > anything the direct print and the decode() don't do - smells like > cargo-cult to me. > > Diez
yes, XML is a *byte*-based format, and so as utf-8 and code-page (cp936, cp1252, etc.). so usually XML will sign its coding at head. but this didn't work now. in Python2.6, sys.getdefaultcoding() return 'ascii', and I can't use sys.setdefaultcoding(), and f.read() return a str. so it must be a undecoded, byte-base format (i.e. raw XML data). so use the right code- page to decode it is safe.(notice the webpage is google.de). in Python3.1, read() returns a bytes object. so we *must* decode it, nor we can't pass it into a parser. -- http://mail.python.org/mailman/listinfo/python-list