perfect, thank you very much Rub
misc wrote: > > > Hello- > > From the wiki faq. > > Is it possible to fetch only pages from some specific domains? > Please have a look on PrefixURLFilter. Adding some regular expressions to > the urlfilter.regex.file might work, but adding a list with thousands of > regular expressions would slow down your system excessively. > > Alternatively, you can set db.ignore.external.links to "true", and inject > seeds from the domains you wish to crawl (these seeds must link to all > pages > you wish to crawl, directly or indirectly). Doing this will let the crawl > go > through only these domains without leaving to start crawling external > links. > Unfortunately there is no way to record external links encountered for > future processing, although a very small patch to the generator code can > allow you to log these links to hadoop.log. > > > > I use the second method. > > see you > > -Jim > > > > > ----- Original Message ----- > From: "rubenll" <[EMAIL PROTECTED]> > To: <[email protected]> > Sent: Friday, November 02, 2007 10:17 AM > Subject: restrict indexing only to a domain list with no using > crawl-urlfilter > > >> >> Hello, crawling in a intranet style it is easy to restrict domains only >> to >> a >> list. I meant, only searching N levels but only in the domains (not >> external >> links). >> >> Using whole-web crawling is there any way to restrict indexing external >> links for domains list with no using crawl-urlfilter?? It has no sense >> for >> me using this file (a hard work). >> >> Regards >> rub >> -- >> View this message in context: >> http://www.nabble.com/restrict-indexing-only-to-a-domain-list-with-no-using-crawl-urlfilter-tf4738836.html#a13551940 >> Sent from the Nutch - User mailing list archive at Nabble.com. >> > > > -- View this message in context: http://www.nabble.com/restrict-indexing-only-to-a-domain-list-with-no-using-crawl-urlfilter-tf4738836.html#a13562194 Sent from the Nutch - User mailing list archive at Nabble.com.
