I think doing this sort of thing works out very well for niche search
engines.
Analyzing the contents of the page takes up some time, but it's just
milliseconds
per page. If you contrast this with actually fetching a page that you don't
want
(several seconds * num pages), you can see that the time savings are very
much
in your favor.
I'm not sure if you'd create a URLFilter since I don't think that gives you
easy
access to the page contents. You could do it in an HtmlParseFilter. Just
copy the
parse-html plugin, look for the bit of code where the Outlinks array is set.
Then filter
that Outlinks array as you see fit.
One thing to be careful about is using regular expressions in Java to
analyze the
page contents. I've had lots of problems with hanging using java.util.regex.
I get
this with perfectly legal regex's, and it's only on certain pages that I get
problems.
It's not as big a problem for me since most of my regex stuff is during the
indexing phase, and it's easy to re-index. If it happens during the fetch,
it's a bigger
pain, since you have to recover from an aborted fetch. So you might want to
do lots of small crawls, instead of big full crawls.
Howie
I think this can be done by using a plug-in like url filter, but it seems
to
cause the performance problem of the crawling process. So I'd like to
listen
to your opinions. Is it possible or meaningful to crawl not just by links
but contents or terms?
-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems? Stop! Download the new AJAX search engine that makes
searching your log files as easy as surfing the web. DOWNLOAD SPLUNK!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=103432&bid=230486&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general