I think doing this sort of thing works out very well for niche search engines. Analyzing the contents of the page takes up some time, but it's just milliseconds per page. If you contrast this with actually fetching a page that you don't want (several seconds * num pages), you can see that the time savings are very much
in your favor.

I'm not sure if you'd create a URLFilter since I don't think that gives you easy access to the page contents. You could do it in an HtmlParseFilter. Just copy the parse-html plugin, look for the bit of code where the Outlinks array is set. Then filter
that Outlinks array as you see fit.

One thing to be careful about is using regular expressions in Java to analyze the page contents. I've had lots of problems with hanging using java.util.regex. I get this with perfectly legal regex's, and it's only on certain pages that I get problems.
It's not as big a problem for me since most of my regex stuff is during the
indexing phase, and it's easy to re-index. If it happens during the fetch, it's a bigger
pain, since you have to recover from an aborted fetch. So you might want to
do lots of small crawls, instead of big full crawls.

Howie


I think this can be done by using a plug-in like url filter, but it seems to cause the performance problem of the crawling process. So I'd like to listen
to your opinions. Is it possible or meaningful to crawl not just by links
but contents or terms?




-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=103432&bid=230486&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to