Thanks for your input ...

> Ultimately, I want to exhaustively scrape a few hundred thousand sites.
> Right now, I'm trying to get my head round some details.  I'm using the
> crawl command for now.

A word of advice avoid using the crawl command its a one hit wonder do
the full cycle even if you are crawling just one website.
[Iain>>] I expect I won't use the crawl command.  But I'm trying to limit
the curve in the learning right now!

>
>
> 1.                   If I use 'nutch readdb <crawldb> –stats', I get no
> output.  Dump works, but I get nothing from stats.  What am I doing wrong?
> Version 0.7.2 works fine, but .8 does not

Its a bug and its been fixed you can find it in SVN. With your version
you might have to look at logs/hadoop.log or change conf/log4j
properties to have it in standard out.
[Iain>>] Thanks.  I found the output in hadoop.log.  Building from sources
and downloading a particular version is part of my learning curve which is
yet to come ... (I hope!)

>
> 3.                   If I scrape a site with embedded flash, the flash is
> not parsed and so I get no linked pages.  Is this what I should expect?
Do
> I need to do something to enable parsing embedded objects and or flash?

plugin-flash I think I am not sure.. you need to enable all the
plugins that you need Have a look nutch-site.xml/nutch-default.xml
under conf.
[Iain>>] So I need to add parse-swf to plugin.includes ...
I've done this and the plugin appears to be loaded, but it does not pick up
the embedded flash object (at least it does not appear to).  So swf files
referenced inside a (flash) <object> container do not seem to be parsed.

[Iain>>] Thanks.

Reply via email to