Re: restrict indexing only to a domain list with no using crawl-urlfilter

misc Fri, 02 Nov 2007 11:19:34 -0800


Hello-


   From the wiki faq.

Is it possible to fetch only pages from some specific domains?

Please have a look on PrefixURLFilter. Adding some regular expressions tothe urlfilter.regex.file might work, but adding a list with thousands ofregular expressions would slow down your system excessively.

Alternatively, you can set db.ignore.external.links to "true", and injectseeds from the domains you wish to crawl (these seeds must link to all pagesyou wish to crawl, directly or indirectly). Doing this will let the crawl gothrough only these domains without leaving to start crawling external links.Unfortunately there is no way to record external links encountered forfuture processing, although a very small patch to the generator code canallow you to log these links to hadoop.log.




   I use the second method.

                       see you

                           -Jim

----- Original Message -----From: "rubenll" <[EMAIL PROTECTED]>

To: <[email protected]>
Sent: Friday, November 02, 2007 10:17 AM

Subject: restrict indexing only to a domain list with no usingcrawl-urlfilter

Hello, crawling in a intranet style it is easy to restrict domains only toalist. I meant, only searching N levels but only in the domains (notexternal
links).

Using whole-web crawling is there any way to restrict indexing external
links for domains list with no using crawl-urlfilter?? It has no sensefor
me using this file (a hard work).

Regards
rub
--
View this message in context:http://www.nabble.com/restrict-indexing-only-to-a-domain-list-with-no-using-crawl-urlfilter-tf4738836.html#a13551940
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: restrict indexing only to a domain list with no using crawl-urlfilter

Reply via email to