Re: Finding specific file types only --> *.ics files

Markus Jelsma Thu, 09 Feb 2012 06:35:17 -0800

they are not filtered out by default filters.

On Thursday 09 February 2012 15:18:39 Peter Jameson wrote:
> Hi Markus,  thanks for your reply!  Noob question:  how do I ensure .ics
> files are not filtered out from the crawl?  I've searched the
> configuration files, but am not sure on parameters to set.  Any help is
> greatly appreciated.  Thanks!
> 
> Sent from my iPad
> 
> On Feb 9, 2012, at 4:04 AM, "Markus Jelsma" <[email protected]> 
wrote:
> > Yes you can. Just crawl the websites as usual with Nutch and make sure
> > ics files are not filtered out. There will be attempts to parse the file
> > but they may fail.
> > In the end all links are in your crawlDb and then you can simply extract
> > a list of .ics urls with the old crawldbscanner tool or the new
> > crawldbreader tool.
> > 
> > On Wednesday 08 February 2012 18:04:49 Peter Jameson wrote:
> >> Hi,
> >> 
> >> I'm interested in using Nutch to crawl certain websites looking for only
> >> a specific file type, in my case I'm looking for any url that ends with
> >> a *.ics construct.  I don't need to "parse" the ics files, I just need
> >> to know all the .ics files that exist.  A list of links would be great.
> >> 
> >> Can Nutch be configured to do this?
> >> 
> >> Thanks!
> >> 
> >> Pete
> >> [email protected]


-- 
Markus Jelsma - CTO - Openindex

Re: Finding specific file types only --> *.ics files

Reply via email to