On 8/6/06, Iain <[EMAIL PROTECTED]> wrote:
I'm a nutch beginner and a bit out of practice with java.
Ultimately, I want to exhaustively scrape a few hundred thousand sites.
Right now, I'm trying to get my head round some details. I'm using the
crawl command for now.
A word of advice avoid using the crawl command its a one hit wonder do
the full cycle even if you are crawling just one website.
1. If I use 'nutch readdb <crawldb> –stats', I get no
output. Dump works, but I get nothing from stats. What am I doing wrong?
Version 0.7.2 works fine, but .8 does not
Its a bug and its been fixed you can find it in SVN. With your version
you might have to look at logs/hadoop.log or change conf/log4j
properties to have it in standard out.
2. I'm using the db.ignore.external.links option and have
had problems setting the crawl-url filter for the hosts. . In the end I've
put in +^http://*/ which seems to work. Is this right?
Again you are better off not using this one hit wonder. Honestly I
would like to propose that we remove this "command". The command
supposed to help users but its actually "hides all the process"
involved for running a search engine. I see no point in hiding these
process from operators who will need to know them no matter what. Then
again its my view.
3. If I scrape a site with embedded flash, the flash is
not parsed and so I get no linked pages. Is this what I should expect? Do
I need to do something to enable parsing embedded objects and or flash?
plugin-flash I think I am not sure.. you need to enable all the
plugins that you need Have a look nutch-site.xml/nutch-default.xml
under conf.
4. If I understand comments in the crawl-urlfilter.txt
correctly, links containing a '?' (queries) are not followed. I most
certainly need to do this.
5. At least some asp pages to do not use anchors and so on
for navigation, but instead posts triggered by javascript. Will nutch
handle these (I imagine this also applies to other self posting pages such
as jsp).
6. Will nutch handle javascript redirects?
As far as i know yes.
Hope someone can help me with some of these!
Iain