I'm trying to recover the original data from some HTML written by a well-known application.
Here are three original data items, in Python repr() format, with spaces changed to tildes for clarity: u'Saturday,~19~January~2008' u'Line1\nLine2\nLine3' u'foonly~frabjous\xa0farnarklingliness' Here is the HTML, with spaces changed to tildes, angle brackets changed to square brackets, omitting \r\n from the end of each line, and stripping a large number of attributes from the [td] tags. ~~[td]Saturday,~19 ~~January~2008[/td] ~~[td]Line1[br] ~~~~Line2[br] ~~~~Line3[/td] ~~[td]foonly ~~frabjous farnarklingliness[/td] Here are the results of feeding it to ElementSoup: >>> import ElementSoup as ES >>> elem = ES.parse('ws_soup1.htm') >>> from pprint import pprint as pp >>> pp([(e.tag, e.text, e.tail) for e in elem.getiterator()]) [snip] (u'td', u'Saturday, 19\n January 2008', u'\n'), (u'td', u'Line1', u'\n'), (u'br', None, u'\n Line2'), (u'br', None, u'\n Line3'), (u'td', u'foonly\n frabjous\xa0farnarklingliness', u'\n')] I'm happy enough with reassembling the second item. The problem is in reliably and correctly collapsing the whitespace in each of the above five elements. The standard Python idiom of u' '.join(text.split()) won't work because the text is Unicode and u'\xa0' is whitespace and would be converted to a space. Should whitespace collapsing be done earlier? Note that BeautifulSoup leaves it as -- ES does the conversion to \xa0 ... Does anyone know of an html_collapse_whitespace() for Python? Am I missing something obvious? Thanks in advance, John -- http://mail.python.org/mailman/listinfo/python-list