Hi there!,
I would like to use the libxml-ruby library to parse HTML obtained directly from the net. Usually the HTML is not correct, so libxml-ruby complaints with a long list of errors.
If i transform the HTML to correct XML using the tidy program i still get some errors like:
---------------------------------------------------- 8< -----------------------------------------------------------------
/tmp/temp.xml26359.0:1: parser error : Space required after the Public Identifier
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
^
/tmp/temp.xml26359.0:1: parser error : SystemLiteral " or ' expected
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
^
/tmp/temp.xml26359.0:1: parser error : SYSTEM or PUBLIC, the URI is missing
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
^
/tmp/temp.xml26359.0:55: parser error : Entity 'nbsp' not defined
</select>
^
/tmp/temp.xml26359.0:57: parser error : Entity 'nbsp' not defined
<input type="submit" value=">>" class="smallForm" /> &n
^
/tmp/temp.xml26359.0:57: parser error : Entity 'nbsp' not defined
<input type="submit" value=">>" class="smallForm" />
^
---------------------------------------------------- 8< -----------------------------------------------------------------
How can i solve those errors without having to modify the XML?
Anyway, is there a way to parse non-correct HTML with libxml directly? Seems that gnome-libxml2 supports that.
Thanks!,
I would like to use the libxml-ruby library to parse HTML obtained directly from the net. Usually the HTML is not correct, so libxml-ruby complaints with a long list of errors.
If i transform the HTML to correct XML using the tidy program i still get some errors like:
---------------------------------------------------- 8< -----------------------------------------------------------------
/tmp/temp.xml26359.0:1: parser error : Space required after the Public Identifier
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
^
/tmp/temp.xml26359.0:1: parser error : SystemLiteral " or ' expected
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
^
/tmp/temp.xml26359.0:1: parser error : SYSTEM or PUBLIC, the URI is missing
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
^
/tmp/temp.xml26359.0:55: parser error : Entity 'nbsp' not defined
</select>
^
/tmp/temp.xml26359.0:57: parser error : Entity 'nbsp' not defined
<input type="submit" value=">>" class="smallForm" /> &n
^
/tmp/temp.xml26359.0:57: parser error : Entity 'nbsp' not defined
<input type="submit" value=">>" class="smallForm" />
^
---------------------------------------------------- 8< -----------------------------------------------------------------
How can i solve those errors without having to modify the XML?
Anyway, is there a way to parse non-correct HTML with libxml directly? Seems that gnome-libxml2 supports that.
Thanks!,
/AITOR
_______________________________________________ libxml-devel mailing list libxml-devel@rubyforge.org http://rubyforge.org/mailman/listinfo/libxml-devel