Re: restrict indexing only to a domain list with no using crawl-urlfilter

rubenll Sat, 03 Nov 2007 03:53:24 -0800

perfect, thank you very much

Rub



misc wrote:
> 
> 
> Hello-
> 
>     From the wiki faq.
> 
> Is it possible to fetch only pages from some specific domains?
> Please have a look on PrefixURLFilter. Adding some regular expressions to 
> the urlfilter.regex.file might work, but adding a list with thousands of 
> regular expressions would slow down your system excessively.
> 
> Alternatively, you can set db.ignore.external.links to "true", and inject 
> seeds from the domains you wish to crawl (these seeds must link to all
> pages 
> you wish to crawl, directly or indirectly). Doing this will let the crawl
> go 
> through only these domains without leaving to start crawling external
> links. 
> Unfortunately there is no way to record external links encountered for 
> future processing, although a very small patch to the generator code can 
> allow you to log these links to hadoop.log.
> 
> 
> 
>     I use the second method.
> 
>                         see you
> 
>                             -Jim
> 
> 
> 
> 
> ----- Original Message ----- 
> From: "rubenll" <[EMAIL PROTECTED]>
> To: <[email protected]>
> Sent: Friday, November 02, 2007 10:17 AM
> Subject: restrict indexing only to a domain list with no using 
> crawl-urlfilter
> 
> 
>>
>> Hello, crawling in a intranet style it is easy to restrict domains only
>> to 
>> a
>> list. I meant, only searching N levels but only in the domains (not 
>> external
>> links).
>>
>> Using whole-web crawling is there any way to restrict indexing external
>> links for domains list with no using crawl-urlfilter??  It has no sense 
>> for
>> me using this file (a hard work).
>>
>> Regards
>> rub
>> -- 
>> View this message in context: 
>> http://www.nabble.com/restrict-indexing-only-to-a-domain-list-with-no-using-crawl-urlfilter-tf4738836.html#a13551940
>> Sent from the Nutch - User mailing list archive at Nabble.com.
>> 
> 
> 
> 

-- 
View this message in context: 
http://www.nabble.com/restrict-indexing-only-to-a-domain-list-with-no-using-crawl-urlfilter-tf4738836.html#a13562194
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: restrict indexing only to a domain list with no using crawl-urlfilter

Reply via email to