Hello Tibor et al, a few weeks ago, I found out that html2text is inadequate to create plain text from HTML, because it only knows about iso-8859-1. I suggest lynx instead, as I explained in this mail (attached fragment).
As I haven't seen any news about this, maybe I should create a ticket for it. But I don't know where, because I'm not up to date about your trac migration. Should I do it myself? Your guidance will be appreciated. Ferran
--- Begin Message ---[...] A second issue I'm having is that, in our site, we have a lot of HTML documents, and a bunch of them are in non-utf8 charset (mostly iso-8859-1 and windows-1251). I have been watching and debugging it the whole morning. In a word, bibindex_engine expect everything in utf8, and when it is not, it complains loudly. Adding the exception to the message, I got: 2010-03-15 09:47:05 --> Error: Cannot put word num??riques with sign 1 for recID 10 (exception: 'utf8' codec can't decode bytes in position 9-11: invalid data). How to get utf8 clean text from any HTML document, from any charset? html2text has the -ascii option to output unaccented text, but it didn't do anything good in my files. Fortunately, lynx does it cleanly. This quick-and-dirty patch allows me to do some progress: @@ -417,6 +417,8 @@ def get_words_from_fulltext(url_direct_or_indirect, stemming_language=None): elif os.path.basename(conv_program) == "html2text": cmd = "%s %s > %s" % \ (conv_program, tmp_name, tmp_dst_name) + cmd = "lynx -dump -display_charset=utf8 %s >%s" % \ + (tmp_name, tmp_dst_name) else: write_message("Error: Do not know how to handle %s conversion program." % conv_program, sys.stderr) # try to run it: [...]
--- End Message ---
