Thanks for your input ... > Ultimately, I want to exhaustively scrape a few hundred thousand sites. > Right now, I'm trying to get my head round some details. I'm using the > crawl command for now.
A word of advice avoid using the crawl command its a one hit wonder do the full cycle even if you are crawling just one website. [Iain>>] I expect I won't use the crawl command. But I'm trying to limit the curve in the learning right now! > > > 1. If I use 'nutch readdb <crawldb> –stats', I get no > output. Dump works, but I get nothing from stats. What am I doing wrong? > Version 0.7.2 works fine, but .8 does not Its a bug and its been fixed you can find it in SVN. With your version you might have to look at logs/hadoop.log or change conf/log4j properties to have it in standard out. [Iain>>] Thanks. I found the output in hadoop.log. Building from sources and downloading a particular version is part of my learning curve which is yet to come ... (I hope!) > > 3. If I scrape a site with embedded flash, the flash is > not parsed and so I get no linked pages. Is this what I should expect? Do > I need to do something to enable parsing embedded objects and or flash? plugin-flash I think I am not sure.. you need to enable all the plugins that you need Have a look nutch-site.xml/nutch-default.xml under conf. [Iain>>] So I need to add parse-swf to plugin.includes ... I've done this and the plugin appears to be loaded, but it does not pick up the embedded flash object (at least it does not appear to). So swf files referenced inside a (flash) <object> container do not seem to be parsed. [Iain>>] Thanks.
