Re: Intranet Crawling

Stefan Neufeind Mon, 05 Jun 2006 15:34:58 -0700

Just use a depth of 10 or whatever. If there are no more pages to crawl
one depth more or less does no harm. For normal websites anything in the
range from 5 to 10 for depth imho should be reasonable.


topN: This allows you to work on only the highest ranked URLs not yet
fetched. It functions as a max. pages limit per each run (depth).


Regards,
 Stefan

Matthew Holt wrote:
> Ok thanks.. as far as crawling the entire subdomain.. what exact command
> would I use?
> 
> Because depth says how many pages deep to go.. is there anyway to hit
> every single page, without specifying depth? Or should I just say
> depth=10? Also, topN is no longer used, correct?
> 
> Stefan Neufeind wrote:
> 
>> Matthew Holt wrote:
>>  
>>
>>> Question,
>>>   I'm trying to index a subdomain of my intranet. How do I make it
>>> index the entire subdomain, but not index any pages off of the
>>> subdomain? Thanks!
>>>   
>>
>> Have a look at crawl-urlfilter.txt in the conf/ directory.
>>
>> # accept hosts in MY.DOMAIN.NAME
>> +^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/
>>
>> # skip everything else
>> -.
>>
>>
>> Regards,
>> Stefan

Re: Intranet Crawling

Reply via email to