Re: controlled depth crawling

Kelvin Tan Mon, 29 Aug 2005 18:24:43 -0700

Michael, you don't need to modify FetcherThread at all.
 
Declare DepthFLFilter in beans.xml within the fetchlist scope filter list:
 
<property name="filters">
      <list>
        <bean class="org.supermind.crawl.scope.NutchUrlFLFilter"/>
        <bean class="org.foo.DepthFLFilter">
          <property name="max"><value>20</value></property>
        </bean>
      </list>
    </property>


That's all you need to do.
 
k

On Mon, 29 Aug 2005 17:18:09 -0700 (PDT), Michael Ji wrote:
> hi Kelvin:
>
> I see your idea and agree with you.
>
> Then, I guess the filter will apply in
>
> FetcherThread.java
> with lines of
> "
> if ( fetchListScope.isInScope(flScopeIn) &
> depthFLFilter.filter(flScopeIn) ).... "
>
> Am I right?
>
> I am in the business trip this week. Hard to squeeze time to do
> testing and developing. But I will keep you updated.
>
> thanks,
>
> Micheal,
>
>
> --- Kelvin Tan <[EMAIL PROTECTED]> wrote:
>
>> Hey Michael, I don't think that would work, because every link on
>> a single page would be decrementing its parent depth.
>>
>> Instead, I would stick to the DepthFLFilter I provided, and
>> changed ScheduledURL's ctor to
>>
>> public ScheduledURL(ScheduledURL parent, URL url) { this.id =
>> assignId();
>> this.seedIndex = parent.seedIndex; this.parentId = parent.id;
>> this.depth = parent.depth + 1; this.url = url; }
>>
>> Then in beans.xml, declare DepthFLFilter as a bean, and set the
>> "max" property to 5.
>>
>> You can even have a more fine-grained control by making a
>> FLFilter that allows you to specify a host and maxDepth, and if a
>> host is not declared, then the default depth is used. Something
>> like
>>
>> <bean
>>
> class="org.supermind.crawl.scope.ExtendedDepthFLFilter">
>
>> <property
>> name="defaultMax"><value>20</value></property> <property
>> name="hosts"> <map> <entry>
>> <key>www.nutch.org</key> <value>7</value> </entry> <entry>
>> <key>www.apache.org</key> <value>2</value> </entry> </map>
>> </property> </bean>
>>
>> (formatting is probably going to end up warped).
>>
>> See what I mean?
>>
>> k
>>
>> On Sun, 28 Aug 2005 19:37:16 -0700 (PDT), Michael Ji wrote:
>>
>>>
>>> Hi Kelvin:
>>>
>>> I tried to implement controlled depth crawling
>>>
>> based on your Nutch-
>>> 84 and the discussion we had before.
>>>
>>> 1. In DepthFLFilter Class,
>>>
>>> I did a bit modification
>>> "
>>> public synchronized int
>> filter(FetchListScope.Input input) {
>>
>>> input.parent.decrementDepth();
>>> return input.parent.depth >= 0 ? ALLOW : REJECT; }
>>>
>> "
>>
>>> 2 In ScheduledURL Class
>>> add one member variable and one member function "
>>>
>> public int depth;
>>
>>> public void decrementDepth() {
>>> depth --;
>>> }
>>> "
>>>
>>> 3 Then
>>>
>>> we need an initial depth for each domain; for the
>>>
>> initial testing;
>>> I can set a default value 5 for all the site in
>>>
>> seeds.txt and for
>>> each outlink, the value will be 1;
>>>
>>> In that way, a pretty vertical crawling is done
>>>
>> for on-site domain
>>> while outlink homepage is still visible;
>>>
>>> Further more, should we define a depth value for
>>>
>> each url in
>>> seeds.txt?
>>>
>>> Did I in the right track?
>>>
>>> Thanks,
>>>
>>> Michael Ji
>>>
>>>
>>> __________________________________  Yahoo! Mail
>>> Stay connected, organized, and protected. Take the
>>>
>> tour:
>>> http://tour.mail.yahoo.com/mailtour.html
>
>
> ____________________________________________________ Start your day
> with Yahoo! - make it your home page http://www.yahoo.com/r/hs

Re: controlled depth crawling

Reply via email to