dieter wrote: > Veek M <vek.m1...@gmail.com> writes: >> UnicodeEncodeError: 'gbk' codec can't encode character u'\xa0' in >> position 8: illegal multibyte sequence > > You give us very little context.
It's a longish chunk of code: basically, i'm trying to download using the 'requests.Session' module and that should give me Unicode once it's told what encoding is being used 'gbk'. def get_page(s, url): print(url) r = s.get(url, headers = { 'User-Agent' : user_agent, 'Keep-Alive' : '3600', 'Connection' : 'keep-alive', }) s.encoding='gbk' text = r.text return text # Open output file fh=codecs.open('/tmp/out', 'wb') fh.write(header) # Download s = requests.Session() ------------ If 'text' is NOT proper unicode because the server introduced some junk, then when i do anchor.getparent() on my 'text' it'll traceback.. ergo the question, how do i set a replacement char within 'requests' > In general: when you need control over encoding handling because > deep in a framework an econding causes problems (as apparently in > your case), you can usually first take the plain text, > fix any encoding problems and only then pass the fixed text to > your framework. > >> I'm doing: >> s = requests.Session() >> to suck data in, so.. how do i 'replace' chars that fit gbk > > It does not seem that the problem occurs inside the "requests" module. > Thus, you have a chance to "intercept" the downloaded text > and fix encoding problems. Okay, so i should use the 'raw' method in requests and then clean up the raw-text and then convert that to unicode.. vs trying to do it using 'requests'? The thing is 'codec's has a xmlcharrefreplace_errors(...) etc so i figured if output has clean up, input ought to have it :p -- https://mail.python.org/mailman/listinfo/python-list