Hello Tibor et al,

a few weeks ago, I found out that html2text is inadequate to create
plain text from HTML, because it only knows about iso-8859-1.  I suggest
lynx instead, as I explained in this mail (attached fragment).

As I haven't seen any news about this, maybe I should create a ticket
for it.  But I don't know where, because I'm not up to date about your
trac migration.

Should I do it myself?  Your guidance will be appreciated.

Ferran

--- Begin Message ---
[...]
A second issue I'm having is that, in our site, we have a lot of HTML
documents, and a bunch of them are in non-utf8 charset (mostly
iso-8859-1 and windows-1251).  I have been watching and debugging it the
whole morning.  In a word, bibindex_engine expect everything in utf8,
and when it is not, it complains loudly.  Adding the exception to the
message, I got:


 2010-03-15 09:47:05 --> Error: Cannot put word num??riques with sign 1 for 
recID 10 (exception: 'utf8' codec can't decode bytes in position 9-11: invalid 
data).


How to get utf8 clean text from any HTML document, from any charset?
html2text has the -ascii option to output unaccented text, but it didn't
do anything good in my files.  Fortunately, lynx does it cleanly.  This
quick-and-dirty patch allows me to do some progress:


@@ -417,6 +417,8 @@ def get_words_from_fulltext(url_direct_or_indirect, 
stemming_language=None):
                 elif os.path.basename(conv_program) == "html2text":
                     cmd = "%s %s > %s" % \
                           (conv_program, tmp_name, tmp_dst_name)
+                    cmd = "lynx -dump -display_charset=utf8 %s >%s" % \
+                        (tmp_name, tmp_dst_name)
                 else:
                     write_message("Error: Do not know how to handle %s 
conversion program." % conv_program, sys.stderr)
                 # try to run it:
[...]

--- End Message ---

Reply via email to