Re: Finding specific file types only --> *.ics files

Markus Jelsma Thu, 09 Feb 2012 02:04:45 -0800

Yes you can. Just crawl the websites as usual with Nutch and make sure ics 
files are not filtered out. There will be attempts to parse the file but they 
may fail.
In the end all links are in your crawlDb and then you can simply extract a 
list of .ics urls with the old crawldbscanner tool or the new crawldbreader 
tool.


On Wednesday 08 February 2012 18:04:49 Peter Jameson wrote:
> Hi,
> 
> I'm interested in using Nutch to crawl certain websites looking for only a
> specific file type, in my case I'm looking for any url that ends with a
> *.ics construct.  I don't need to "parse" the ics files, I just need to
> know all the .ics files that exist.  A list of links would be great.
> 
> Can Nutch be configured to do this?
> 
> Thanks!
> 
> Pete
> [email protected]

-- 
Markus Jelsma - CTO - Openindex

Re: Finding specific file types only --> *.ics files

Reply via email to