hi Kelvin:
I see your idea and agree with you.
Then, I guess the filter will apply in
FetcherThread.java
with lines of
"
if ( fetchListScope.isInScope(flScopeIn) &
depthFLFilter.filter(flScopeIn) )....
"
Am I right?
I am in the business trip this week. Hard to squeeze
time to do testing and developing. But I will keep you
updated.
thanks,
Micheal,
--- Kelvin Tan <[EMAIL PROTECTED]> wrote:
> Hey Michael, I don't think that would work, because
> every link on a single page would be decrementing
> its parent depth.
>
> Instead, I would stick to the DepthFLFilter I
> provided, and changed ScheduledURL's ctor to
>
> public ScheduledURL(ScheduledURL parent, URL url) {
> this.id = assignId();
> this.seedIndex = parent.seedIndex;
> this.parentId = parent.id;
> this.depth = parent.depth + 1;
> this.url = url;
> }
>
> Then in beans.xml, declare DepthFLFilter as a bean,
> and set the "max" property to 5.
>
> You can even have a more fine-grained control by
> making a FLFilter that allows you to specify a host
> and maxDepth, and if a host is not declared, then
> the default depth is used. Something like
>
> <bean
>
class="org.supermind.crawl.scope.ExtendedDepthFLFilter">
> <property
> name="defaultMax"><value>20</value></property>
> <property name="hosts">
> <map>
> <entry>
> <key>www.nutch.org</key>
> <value>7</value>
> </entry>
> <entry>
> <key>www.apache.org</key>
> <value>2</value>
> </entry>
> </map>
> </property>
> </bean>
>
> (formatting is probably going to end up warped).
>
> See what I mean?
>
> k
>
> On Sun, 28 Aug 2005 19:37:16 -0700 (PDT), Michael Ji
> wrote:
> >
> > Hi Kelvin:
> >
> > I tried to implement controlled depth crawling
> based on your Nutch-
> > 84 and the discussion we had before.
> >
> > 1. In DepthFLFilter Class,
> >
> > I did a bit modification
> > "
> > public synchronized int
> filter(FetchListScope.Input input) {
> > input.parent.decrementDepth();
> > return input.parent.depth >= 0 ? ALLOW : REJECT; }
> "
> >
> > 2 In ScheduledURL Class
> > add one member variable and one member function "
> public int depth;
> >
> > public void decrementDepth() {
> > depth --;
> > }
> > "
> >
> > 3 Then
> >
> > we need an initial depth for each domain; for the
> initial testing;
> > I can set a default value 5 for all the site in
> seeds.txt and for
> > each outlink, the value will be 1;
> >
> > In that way, a pretty vertical crawling is done
> for on-site domain
> > while outlink homepage is still visible;
> >
> > Further more, should we define a depth value for
> each url in
> > seeds.txt?
> >
> > Did I in the right track?
> >
> > Thanks,
> >
> > Michael Ji
> >
> >
> > __________________________________
> > Yahoo! Mail
> > Stay connected, organized, and protected. Take the
> tour:
> > http://tour.mail.yahoo.com/mailtour.html
>
>
>
____________________________________________________
Start your day with Yahoo! - make it your home page
http://www.yahoo.com/r/hs