Re: controlled depth crawling

Michael Ji Mon, 29 Aug 2005 17:18:13 -0700

hi Kelvin:

I see your idea and agree with you.


Then, I guess the filter will apply in 

FetcherThread.java
with lines of
"
if ( fetchListScope.isInScope(flScopeIn) &
depthFLFilter.filter(flScopeIn) )....
"

Am I right?

I am in the business trip this week. Hard to squeeze
time to do testing and developing. But I will keep you
updated.

thanks,

Micheal, 


--- Kelvin Tan <[EMAIL PROTECTED]> wrote:

> Hey Michael, I don't think that would work, because
> every link on a single page would be decrementing
> its parent depth.
> 
> Instead, I would stick to the DepthFLFilter I
> provided, and changed ScheduledURL's ctor to
> 
> public ScheduledURL(ScheduledURL parent, URL url) {
>     this.id = assignId();
>     this.seedIndex = parent.seedIndex;
>     this.parentId = parent.id;
>     this.depth = parent.depth + 1;
>     this.url = url;
>   }
> 
> Then in beans.xml, declare DepthFLFilter as a bean,
> and set the "max" property to 5.
> 
> You can even have a more fine-grained control by
> making a FLFilter that allows you to specify a host
> and maxDepth, and if a host is not declared, then
> the default depth is used. Something like
> 
> <bean
>
class="org.supermind.crawl.scope.ExtendedDepthFLFilter">
>           <property
> name="defaultMax"><value>20</value></property>
>               <property name="hosts">
>             <map>
>               <entry>
>                 <key>www.nutch.org</key>
>                 <value>7</value>
>               </entry>
>               <entry>
>                 <key>www.apache.org</key>
>                 <value>2</value>
>               </entry>
>             </map>
>           </property>
>         </bean>
> 
> (formatting is probably going to end up warped).
> 
> See what I mean?
> 
> k
> 
> On Sun, 28 Aug 2005 19:37:16 -0700 (PDT), Michael Ji
> wrote:
> >
> > Hi Kelvin:
> >
> > I tried to implement controlled depth crawling
> based on your Nutch-
> > 84 and the discussion we had before.
> >
> > 1. In DepthFLFilter Class,
> >
> > I did a bit modification
> > "
> > public synchronized int
> filter(FetchListScope.Input input) {
> > input.parent.decrementDepth();
> > return input.parent.depth >= 0 ? ALLOW : REJECT; }
> "
> >
> > 2 In ScheduledURL Class
> > add one member variable and one member function "
> public int depth;
> >
> > public void decrementDepth() {
> > depth --;
> > }
> > "
> >
> > 3 Then
> >
> > we need an initial depth for each domain; for the
> initial testing;
> > I can set a default value 5 for all the site in
> seeds.txt and for
> > each outlink, the value will be 1;
> >
> > In that way, a pretty vertical crawling is done
> for on-site domain
> > while outlink homepage is still visible;
> >
> > Further more, should we define a depth value for
> each url in
> > seeds.txt?
> >
> > Did I in the right track?
> >
> > Thanks,
> >
> > Michael Ji
> >
> >
> > __________________________________
> > Yahoo! Mail
> > Stay connected, organized, and protected. Take the
> tour:
> > http://tour.mail.yahoo.com/mailtour.html
> 
> 
> 



                
____________________________________________________
Start your day with Yahoo! - make it your home page 
http://www.yahoo.com/r/hs

Re: controlled depth crawling

Reply via email to