Walter Underwood wrote:

Extracting links using a regular HTML parser works fine, and isn't
that much work. One of the major issues in an HTML parser is
dealing with all the illegal HTML on the web.



It really depends on what you are looking for, and how tolerant of errors you are. For most of what I do, I use the HTML parser, but I have also done simple expression matching to pull out links. This tends to overestimate the links (e.g., pulling out references in comments, etc.), and often yields fragments that are not really followable, but it is at least a possibility.

_______________________________________________
Robots mailing list
[EMAIL PROTECTED]
http://www.mccmedia.com/mailman/listinfo/robots

Reply via email to