[Nutch-general] RE: Content-based Crawl vs Link-based Crawl?

Howie Wang Sat, 18 Feb 2006 10:48:06 -0800

I think doing this sort of thing works out very well for niche searchengines.Analyzing the contents of the page takes up some time, but it's justmillisecondsper page. If you contrast this with actually fetching a page that you don'twant(several seconds * num pages), you can see that the time savings are verymuch

in your favor.

I'm not sure if you'd create a URLFilter since I don't think that gives youeasyaccess to the page contents. You could do it in an HtmlParseFilter. Justcopy theparse-html plugin, look for the bit of code where the Outlinks array is set.Then filter

that Outlinks array as you see fit.

One thing to be careful about is using regular expressions in Java toanalyze thepage contents. I've had lots of problems with hanging using java.util.regex.I getthis with perfectly legal regex's, and it's only on certain pages that I getproblems.

It's not as big a problem for me since most of my regex stuff is during the

indexing phase, and it's easy to re-index. If it happens during the fetch,it's a bigger

pain, since you have to recover from an aborted fetch. So you might want to
do lots of small crawls, instead of big full crawls.

Howie

I think this can be done by using a plug-in like url filter, but it seemstocause the performance problem of the crawling process. So I'd like tolisten
to your opinions. Is it possible or meaningful to crawl not just by links
but contents or terms?





-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=103432&bid=230486&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

[Nutch-general] RE: Content-based Crawl vs Link-based Crawl?

Reply via email to