Hi Kelvin:
I tried to implement controlled depth crawling based
on your Nutch-84 and the discussion we had before.
1. In DepthFLFilter Class,
I did a bit modification
"
public synchronized int filter(FetchListScope.Input
input) {
input.parent.decrementDepth();
return input.parent.depth >= 0 ? ALLOW : REJECT;
}
"
2 In ScheduledURL Class
add one member variable and one member function
"
public int depth;
public void decrementDepth() {
depth --;
}
"
3 Then
we need an initial depth for each domain; for the
initial testing; I can set a default value 5 for all
the site in seeds.txt and for each outlink, the value
will be 1;
In that way, a pretty vertical crawling is done for
on-site domain while outlink homepage is still
visible;
Further more, should we define a depth value for each
url in seeds.txt?
Did I in the right track?
Thanks,
Michael Ji
__________________________________
Yahoo! Mail
Stay connected, organized, and protected. Take the tour:
http://tour.mail.yahoo.com/mailtour.html