Hello, I have what I assume to be a simple user issue with nutch-0.8-dev. I'm using Nutch to do a single site crawl on a Fedora Core 4 Linux machine. The site I'm crawling consists of Perl (Catalyst to be specific), and PHP (an app called gallery, and an instance of Media Wiki).
The issue I'm having is that Nutch does not seem to crawl the gallery section of the site. There are links from the main site to gallery, and I've listed the top level gallery URL my initial url list I pass to nutch crawl. Sorry for the length of the message, but I wanted to try to provide as much information about the problem as I could. Nutch does crawl the wiki and perl sections of the site. Crawl Command: nutch crawl urls -dir ../nutch-index -depth 25 -topN10000 The urls dir contains one file called urls.txt: http://www.philadelphiariders.com/ http://www.philadelphiariders.com/c/dmoz/Top.html http://www.philadelphiariders.com/gallery/ The only change I've nade to crawl-urlfilter.txt is: +^http://www.philadelphiariders.com/ which I replaced the example regex rule that was there out of the box. In the index output, I see a reference to the gallery: Indexing [http://www.philadelphiariders.com/gallery/] with analyzer [EMAIL PROTECTED] (null) But the rest of the gallery is not referenced in index output. The command ./bin/nutch readdb ../nutch-index/crawldb -dump ./dumpdata Has only these two entries referencing the gallery. Does the Status of view_album.php have anything to do with my issue? http://www.philadelphiariders.com/gallery/ Version: 4 Status: 2 (DB_fetched) Fetch time: Tue May 16 15:20:15 EDT 2006 Modified time: Wed Dec 31 19:00:00 EST 1969 Retries since fetch: 0 Retry interval: 30.0 days Score: 316.0114 Signature: b7619f18442c6f356f802ba7847dc127 http://www.philadelphiariders.com/gallery/view_album.php Version: 4 Status: 3 (DB_gone) Fetch time: Sun Apr 16 15:21:12 EDT 2006 Modified time: Wed Dec 31 19:00:00 EST 1969 Retries since fetch: 0 Retry interval: 30.0 days Score: 2.0824916 Signature: null Links that are not indexed are in the linkdb: ./bin/nutch readlinkdb ../nutch-index/linkdb -dump joe yields: http://philadelphiariders.com/gallery/2005-Events Inlinks: fromUrl: http://www.philadelphiariders.com/gallery/ anchor: http://philadelphiariders.com/gallery/2006-Events Inlinks: fromUrl: http://www.philadelphiariders.com/gallery/ anchor: fromUrl: http://www.philadelphiariders.com/gallery/ anchor: 2006 Events http://philadelphiariders.com/gallery/2nd-Sunday-Rides Inlinks: fromUrl: http://www.philadelphiariders.com/gallery/ anchor: fromUrl: http://www.philadelphiariders.com/c/ anchor: 2nd Sunday Rides fromUrl: http://www.philadelphiariders.com/gallery/ anchor: 2nd Sunday Rides http://philadelphiariders.com/gallery/April-2006 Inlinks: fromUrl: http://www.philadelphiariders.com/c/ anchor: April 2006 http://philadelphiariders.com/gallery/Marilyns-Photos Inlinks: fromUrl: http://www.philadelphiariders.com/c/ anchor: Marilyn's Photos http://philadelphiariders.com/gallery/Rider-Gallery Inlinks: fromUrl: http://www.philadelphiariders.com/gallery/ anchor: Rider Gallery fromUrl: http://www.philadelphiariders.com/gallery/ anchor: Also, a lot fo the navigation in the Gallery application makes use of GET parameters. To follow links contianing these, would I need to tweak crawl-urlfilter.txt to remove the following line: # skip URLs containing certain characters as probable queries, etc. [EMAIL PROTECTED] I don't think this is the whole problem, because the root url for the gallery has been fetched/ indexed. This page contains links that are not queryies (i.e. contain ?). Thanks in advance for any help you can offer. Andy -- Andrew Libby [EMAIL PROTECTED] http://philadelphiariders.com/
