[Nutch-general] Re: Buggy fetchlist' urls

Andrzej Bialecki Mon, 13 Mar 2006 23:50:07 -0800

Florent Gluck wrote:

Some urls are totally bogus.  I didn't investigate what could be causing
this yet, but it looks like it could be a parsing issue.  Some urls
contain some javascript code and others contain some html tags.

This is a side-effect of our primitive parse-js, which doesn't reallyparse anything, just uses some heuristic to extract possible URLs.Unfortunately, often as not the strings it extracts don't have anythingto do with URLs.


If you have suggestions on how to improve it I'm all ears.

--
Best regards,
Andrzej Bialecki     <><
___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com




-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting language
that extends applications into web and mobile media. Attend the live webcast
and join the prime developer group breaking into this new coding territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

[Nutch-general] Re: Buggy fetchlist' urls

Reply via email to