they are not filtered out by default filters. On Thursday 09 February 2012 15:18:39 Peter Jameson wrote: > Hi Markus, thanks for your reply! Noob question: how do I ensure .ics > files are not filtered out from the crawl? I've searched the > configuration files, but am not sure on parameters to set. Any help is > greatly appreciated. Thanks! > > Sent from my iPad > > On Feb 9, 2012, at 4:04 AM, "Markus Jelsma" <[email protected]> wrote: > > Yes you can. Just crawl the websites as usual with Nutch and make sure > > ics files are not filtered out. There will be attempts to parse the file > > but they may fail. > > In the end all links are in your crawlDb and then you can simply extract > > a list of .ics urls with the old crawldbscanner tool or the new > > crawldbreader tool. > > > > On Wednesday 08 February 2012 18:04:49 Peter Jameson wrote: > >> Hi, > >> > >> I'm interested in using Nutch to crawl certain websites looking for only > >> a specific file type, in my case I'm looking for any url that ends with > >> a *.ics construct. I don't need to "parse" the ics files, I just need > >> to know all the .ics files that exist. A list of links would be great. > >> > >> Can Nutch be configured to do this? > >> > >> Thanks! > >> > >> Pete > >> [email protected]
-- Markus Jelsma - CTO - Openindex

