Im a nutch beginner and a bit out of practice with java.
Ultimately, I want to exhaustively scrape a few hundred thousand sites. Right now, Im trying to get my head round some details. Im using the crawl command for now. 1. If I use nutch readdb <crawldb> stats, I get no output. Dump works, but I get nothing from stats. What am I doing wrong? Version 0.7.2 works fine, but .8 does not 2. Im using the db.ignore.external.links option and have had problems setting the crawl-url filter for the hosts. . In the end Ive put in +^http://*/ which seems to work. Is this right? 3. If I scrape a site with embedded flash, the flash is not parsed and so I get no linked pages. Is this what I should expect? Do I need to do something to enable parsing embedded objects and or flash? 4. If I understand comments in the crawl-urlfilter.txt correctly, links containing a ? (queries) are not followed. I most certainly need to do this. 5. At least some asp pages to do not use anchors and so on for navigation, but instead posts triggered by javascript. Will nutch handle these (I imagine this also applies to other self posting pages such as jsp). 6. Will nutch handle javascript redirects? Hope someone can help me with some of these! Iain
