Magnus Pettersson wrote:
> # This made the fetching of the website work. Why did i have to write > # url.encode("UTF-8") when url already is unicode? I feel i dont have a > # good understanding of this. > page = urllib2.urlopen(url.encode("UTF-8")) Start here: "The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)" http://www.joelonsoftware.com/articles/Unicode.html Basically, Unicode is an in-memory data format. Python knows about Unicode characters (to be technical: code points), but files on disk do not. Neither do network protocols, or terminals, or other simple devices. They only understand bytes. So when you have Unicode text, and you want to write it to a file on disk, or print it, or send it over the network to another machine, it has to be *encoded* into bytes, and then *decoded* back into Unicode when you read it from the file again. Sometimes the system will "helpfully" do that encoding and decoding automatically for you, which is fine when it works but when it doesn't it can be perplexing. There are many, many, many different *encoding schemes*. ASCII is one. UTF-8 is another. And then there are about a bazillion legacy encodings which, if you are lucky, you will never need to care about. Only some encodings can deal with the entire range of Unicode characters, most can only deal with a (typically small) subset of possible characters. E.g. ASCII only knows about 127 characters out of the million-plus that Unicode deals with. Latin-1 can handle close to 256 different characters. If you have a say in the matter, always use UTF-8, since it can handle the full set of Unicode characters in the most efficient manner. -- Steven -- http://mail.python.org/mailman/listinfo/python-list