Hey Michael, I don't think that would work, because every link on a single page
would be decrementing its parent depth.
Instead, I would stick to the DepthFLFilter I provided, and changed
ScheduledURL's ctor to
public ScheduledURL(ScheduledURL parent, URL url) {
this.id = assignId();
this.seedIndex = parent.seedIndex;
this.parentId = parent.id;
this.depth = parent.depth + 1;
this.url = url;
}
Then in beans.xml, declare DepthFLFilter as a bean, and set the "max" property
to 5.
You can even have a more fine-grained control by making a FLFilter that allows
you to specify a host and maxDepth, and if a host is not declared, then the
default depth is used. Something like
<bean class="org.supermind.crawl.scope.ExtendedDepthFLFilter">
<property name="defaultMax"><value>20</value></property>
<property name="hosts">
<map>
<entry>
<key>www.nutch.org</key>
<value>7</value>
</entry>
<entry>
<key>www.apache.org</key>
<value>2</value>
</entry>
</map>
</property>
</bean>
(formatting is probably going to end up warped).
See what I mean?
k
On Sun, 28 Aug 2005 19:37:16 -0700 (PDT), Michael Ji wrote:
>
> Hi Kelvin:
>
> I tried to implement controlled depth crawling based on your Nutch-
> 84 and the discussion we had before.
>
> 1. In DepthFLFilter Class,
>
> I did a bit modification
> "
> public synchronized int filter(FetchListScope.Input input) {
> input.parent.decrementDepth();
> return input.parent.depth >= 0 ? ALLOW : REJECT; } "
>
> 2 In ScheduledURL Class
> add one member variable and one member function " public int depth;
>
> public void decrementDepth() {
> depth --;
> }
> "
>
> 3 Then
>
> we need an initial depth for each domain; for the initial testing;
> I can set a default value 5 for all the site in seeds.txt and for
> each outlink, the value will be 1;
>
> In that way, a pretty vertical crawling is done for on-site domain
> while outlink homepage is still visible;
>
> Further more, should we define a depth value for each url in
> seeds.txt?
>
> Did I in the right track?
>
> Thanks,
>
> Michael Ji
>
>
> __________________________________
> Yahoo! Mail
> Stay connected, organized, and protected. Take the tour:
> http://tour.mail.yahoo.com/mailtour.html