Hi there!,

   I would like to use the libxml-ruby library to parse HTML obtained directly from the net.  Usually the HTML is not correct, so libxml-ruby complaints with a long list of errors.

   If i transform the HTML to correct XML using the tidy program i still get some errors like:

---------------------------------------------------- 8< -----------------------------------------------------------------

/tmp/temp.xml26359.0:1: parser error : Space required after the Public Identifier
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
                                                              ^
/tmp/temp.xml26359.0:1: parser error : SystemLiteral " or ' expected
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
                                                              ^
/tmp/temp.xml26359.0:1: parser error : SYSTEM or PUBLIC, the URI is missing
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
                                                              ^
/tmp/temp.xml26359.0:55: parser error : Entity 'nbsp' not defined
              </select>&nbsp;
                             ^
/tmp/temp.xml26359.0:57: parser error : Entity 'nbsp' not defined
              <input type="submit" value="&gt;&gt;" class="smallForm" />&nbsp;&n
                                                                              ^
/tmp/temp.xml26359.0:57: parser error : Entity 'nbsp' not defined
          <input type="submit" value="&gt;&gt;" class="smallForm" />&nbsp;&nbsp;
                                                                               ^
---------------------------------------------------- 8< -----------------------------------------------------------------

   How can i solve those errors without having to modify the XML?

   Anyway, is there a way to parse non-correct HTML with libxml directly?  Seems that gnome-libxml2 supports that.

   Thanks!,

   /AITOR
_______________________________________________
libxml-devel mailing list
libxml-devel@rubyforge.org
http://rubyforge.org/mailman/listinfo/libxml-devel

Reply via email to