Re: Some clauses cases BeautifulSoup to choke?

Marc Christiansen Tue, 20 Nov 2007 12:16:30 -0800

Frank Stutzman <[EMAIL PROTECTED]> wrote:
> 
> Some kind person replied:
>> You have the same URL as both your good and bad example.
> 
> Oops, dang emacs cut buffer (yeah, thats what did it).  A working 
> example url would be (again, mind the wrap):
> 
> http://www.naco.faa.gov/digital_tpp_search.asp?fldIdent=ksfo&fld_ident_type=ICAO&ver=0711&bnSubmit=Complete+Search
>  
> 
> 
> Marc Christiansen <[EMAIL PROTECTED]> wrote:
> 
>> The problem is this line:
>> <META http-equiv="Content-Type" content="text/html; charset=UTF-16">
>> 
>> Which is wrong. The content is not utf-16 encoded. The line after that
>> declares the charset as utf-8, which is correct, although ascii would be
>> ok too.
> 
> Ah, er, hmmm.  Take a look the 'good' URL I mentioned above.  You will 
> notice that it has the same utf-16, utf-8 encoding that the 'bad' one
> has.  And BeautifulSoup works great on it.  
> 
> I'm still scratchin' ma head...


 >>> s = bad.decode("utf-16")
 >>> s = good.decode("utf-16")
 Traceback (most recent call last):
   File "<stdin>", line 1, in <module>
   File "/usr/lib/python2.5/encodings/utf_16.py", line 16, in decode
     return codecs.utf_16_decode(input, errors, True)
 UnicodeDecodeError: 'utf16' codec can't decode byte 0x0a in position 41176: 
truncated data

bad contains the content of the 'bad' url, good the content of the
'good' url. Because of the UnicodeDecodeError, BeautifulSoup tries
either the next encoding or the next step from the url below. 
>> <http://www.crummy.com/software/BeautifulSoup/documentation.html#Beautiful 
>> Soup Gives You Unicode, Dammit>)

> Much appreciate all the comments so far.

You're welcome.

Marc
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Some clauses cases BeautifulSoup to choke?

Reply via email to