I’m a nutch beginner and a bit out of practice with java.

 

Ultimately, I want to exhaustively scrape a few hundred thousand sites.
Right now, I’m trying to get my head round some details.  I’m using the
crawl command for now.

 

1.                   If I use ‘nutch readdb <crawldb> –stats’, I get no
output.  Dump works, but I get nothing from stats.  What am I doing wrong?
Version 0.7.2 works fine, but .8 does not

2.                   I’m using the db.ignore.external.links option and have
had problems setting the crawl-url filter for the hosts. .  In the end I’ve
put in +^http://*/ which seems to work.  Is this right?  

3.                   If I scrape a site with embedded flash, the flash is
not parsed and so I get no linked pages.  Is this what I should expect?  Do
I need to do something to enable parsing embedded objects and or flash?

4.                   If I understand comments in the crawl-urlfilter.txt
correctly, links containing a ‘?’ (queries) are not followed.  I most
certainly need to do this.

5.                   At least some asp pages to do not use anchors and so on
for navigation, but instead posts triggered by javascript.  Will nutch
handle these (I imagine this also applies to other self posting pages such
as jsp).

6.                   Will nutch handle javascript redirects?

 

Hope someone can help me with some of these!

 

Iain

Reply via email to