Hey Michael, I don't think that would work, because every link on a single page 
would be decrementing its parent depth.

Instead, I would stick to the DepthFLFilter I provided, and changed 
ScheduledURL's ctor to

public ScheduledURL(ScheduledURL parent, URL url) {
    this.id = assignId();
    this.seedIndex = parent.seedIndex;
    this.parentId = parent.id;
    this.depth = parent.depth + 1;
    this.url = url;
  }

Then in beans.xml, declare DepthFLFilter as a bean, and set the "max" property 
to 5.

You can even have a more fine-grained control by making a FLFilter that allows 
you to specify a host and maxDepth, and if a host is not declared, then the 
default depth is used. Something like

<bean class="org.supermind.crawl.scope.ExtendedDepthFLFilter">
          <property name="defaultMax"><value>20</value></property>
                <property name="hosts">
            <map>
              <entry>
                <key>www.nutch.org</key>
                <value>7</value>
              </entry>
              <entry>
                <key>www.apache.org</key>
                <value>2</value>
              </entry>
            </map>
          </property>
        </bean>

(formatting is probably going to end up warped).

See what I mean?

k

On Sun, 28 Aug 2005 19:37:16 -0700 (PDT), Michael Ji wrote:
>
> Hi Kelvin:
>
> I tried to implement controlled depth crawling based on your Nutch-
> 84 and the discussion we had before.
>
> 1. In DepthFLFilter Class,
>
> I did a bit modification
> "
> public synchronized int filter(FetchListScope.Input input) {
> input.parent.decrementDepth();
> return input.parent.depth >= 0 ? ALLOW : REJECT; } "
>
> 2 In ScheduledURL Class
> add one member variable and one member function " public int depth;
>
> public void decrementDepth() {
> depth --;
> }
> "
>
> 3 Then
>
> we need an initial depth for each domain; for the initial testing;
> I can set a default value 5 for all the site in seeds.txt and for
> each outlink, the value will be 1;
>
> In that way, a pretty vertical crawling is done for on-site domain
> while outlink homepage is still visible;
>
> Further more, should we define a depth value for each url in
> seeds.txt?
>
> Did I in the right track?
>
> Thanks,
>
> Michael Ji
>
>
> __________________________________
> Yahoo! Mail
> Stay connected, organized, and protected. Take the tour:
> http://tour.mail.yahoo.com/mailtour.html


Reply via email to