If I understand you correctly, you state that even if my question is related to the current thread, nevertheless I must open a new one?
-----Original Message----- From: Lewis John Mcgibbney <lewis.mcgibb...@gmail.com> To: user <user@nutch.apache.org> Sent: Thu, Dec 1, 2011 3:01 pm Subject: Re: Fetching just some urls outside domain Nutch comes packed with quite a few url-filters out of the box. They just need some tuning. Have a look in NUTCH_HOME/conf Also have a look at the corresponding plugins. Realistically you should really start a new thread for new questions :0) I think you're looking for the urlfilter-domain plugin On Thu, Dec 1, 2011 at 10:48 PM, <alx...@aim.com> wrote: > Hello, > > It is interesting to know how can one put a filter on outlinks? I mean if > I have a regex, in which file should I put it? > For example, I want nutch to ignore outlinks ending with .info. > > Thanks. > Alex. > > > > > > > > -----Original Message----- > From: Arkadi.Kosmynin <arkadi.kosmy...@csiro.au> > To: user <user@nutch.apache.org> > Sent: Thu, Dec 1, 2011 1:44 pm > Subject: RE: Fetching just some urls outside domain > > > Hi Adriana, > > You can try Arch for this: > > http://www.atnf.csiro.au/computing/software/arch > > You can configure it to crawl your web sites plus sets of miscellaneous > URLs > called "bookmarks" in Arch. Arch is a free extension of Nutch. Right now, > only > Arch based on Nutch 1.2 is available for downloading. We are about to > release > Arch based on Nutch 1.4. > > Regards, > > Arkadi > > > > > -----Original Message----- > > From: Adriana Farina [mailto:adriana.farin...@gmail.com] > > Sent: Thursday, 1 December 2011 7:58 PM > > To: user@nutch.apache.org > > Subject: Re: Fetching just some urls outside domain > > > > Hi! > > > > Thank you for your answer. You're right, maybe an example would explain > > better what I need to do. > > > > I have to perform the following task. I have to explore a specific > > domain (. > > gov.it) and I have an initial set of seeds, for example www.aaa.it, > > www.bbb.gov.it, www.ccc.it. I configured nutch so that it doesn't fetch > > pages outside that domain. However some resources I need to download > > (documents) are stored on web sites that are not inside the domain I'm > > interested in. > > For example: www.aaa.it/subfolder/albi redirects to www.somesite.it > > (where > > www.somesite.it is not inside "my" domain). Nutch will not fetch that > > page > > since I told it to behave that way, but I need to download documents > > stored > > on www.somesite.it. So I need nutch to go outside the domain I > > specified > > only when it sees the words "albi" or "albo" inside the url, since that > > words identify the documents I need. How can I do this? > > > > I hope I've been clear. :) > > > > > > > > 2011/11/30 Lewis John Mcgibbney <lewis.mcgibb...@gmail.com> > > > > > Hi Adriana, > > > > > > This should be achievable through fine grained URL filters. It is > > kindof > > > hard to substantiate on this without you providing some examples of > > the > > > type of stuff you're trying to do! > > > > > > Lewis > > > > > > On Mon, Nov 28, 2011 at 11:14 AM, Adriana Farina < > > > adriana.farin...@gmail.com > > > > wrote: > > > > > > > Hello, > > > > > > > > I'm using nutch 1.3 from just a month, so I'm not an expert. I > > configured > > > > it so that it doesn't fetch pages outside a specific domain. > > However now > > > I > > > > need to let it fetch pages outside the domain I choosed but only > > for some > > > > urls (not for all the urls I have to crawl). How can I do this? I > > have to > > > > write a new plugin? > > > > > > > > Thanks. > > > > > > > > > > > > > > > > -- > > > *Lewis* > > > > > > -- *Lewis*