Nutch crawl not fetching portions of site

Andrew Libby Tue, 18 Apr 2006 07:31:12 -0700

Hello,

I have what I assume to be a simple user issue with nutch-0.8-dev.  I'm
using Nutch
to do a single site crawl on a Fedora Core 4 Linux machine.  The site
I'm crawling consists
of Perl (Catalyst to be specific), and PHP (an app called gallery, and
an instance of Media Wiki).


The issue I'm having is that Nutch does not seem to crawl the gallery
section of the site.
There are links from the main site to gallery, and I've listed the top
level gallery URL
my initial url list I pass to nutch crawl.

Sorry for the length of the message, but I wanted to try to provide as
much information about
the problem as I could.

Nutch does crawl the wiki and perl sections of the site.

Crawl Command:

nutch crawl urls -dir ../nutch-index -depth 25 -topN10000

The urls dir contains one file called urls.txt:

http://www.philadelphiariders.com/
http://www.philadelphiariders.com/c/dmoz/Top.html
http://www.philadelphiariders.com/gallery/

The only change I've nade to crawl-urlfilter.txt is:

+^http://www.philadelphiariders.com/

which I replaced the example regex rule that was there out of the box.

In the index output, I see a reference to the gallery:

 Indexing [http://www.philadelphiariders.com/gallery/] with analyzer
[EMAIL PROTECTED] (null)

But the rest of the gallery is not referenced in index output. 

The command  ./bin/nutch readdb ../nutch-index/crawldb -dump ./dumpdata
Has only these two entries referencing the gallery.    Does the Status of
view_album.php have anything to do with my issue? 

http://www.philadelphiariders.com/gallery/  Version: 4
Status: 2 (DB_fetched)
Fetch time: Tue May 16 15:20:15 EDT 2006
Modified time: Wed Dec 31 19:00:00 EST 1969
Retries since fetch: 0
Retry interval: 30.0 days
Score: 316.0114
Signature: b7619f18442c6f356f802ba7847dc127

http://www.philadelphiariders.com/gallery/view_album.php    Version: 4
Status: 3 (DB_gone)
Fetch time: Sun Apr 16 15:21:12 EDT 2006
Modified time: Wed Dec 31 19:00:00 EST 1969
Retries since fetch: 0
Retry interval: 30.0 days
Score: 2.0824916
Signature: null

Links that are not indexed are in the linkdb:

./bin/nutch readlinkdb ../nutch-index/linkdb -dump joe

yields:

http://philadelphiariders.com/gallery/2005-Events   Inlinks:
 fromUrl: http://www.philadelphiariders.com/gallery/ anchor:

http://philadelphiariders.com/gallery/2006-Events   Inlinks:
 fromUrl: http://www.philadelphiariders.com/gallery/ anchor:
 fromUrl: http://www.philadelphiariders.com/gallery/ anchor: 2006 Events

http://philadelphiariders.com/gallery/2nd-Sunday-Rides  Inlinks:
 fromUrl: http://www.philadelphiariders.com/gallery/ anchor:
 fromUrl: http://www.philadelphiariders.com/c/ anchor: 2nd Sunday Rides
 fromUrl: http://www.philadelphiariders.com/gallery/ anchor: 2nd Sunday
Rides

http://philadelphiariders.com/gallery/April-2006    Inlinks:
 fromUrl: http://www.philadelphiariders.com/c/ anchor: April 2006

http://philadelphiariders.com/gallery/Marilyns-Photos   Inlinks:
 fromUrl: http://www.philadelphiariders.com/c/ anchor: Marilyn's Photos

http://philadelphiariders.com/gallery/Rider-Gallery Inlinks:
 fromUrl: http://www.philadelphiariders.com/gallery/ anchor: Rider Gallery
 fromUrl: http://www.philadelphiariders.com/gallery/ anchor:

Also, a lot fo the navigation in the Gallery application makes use of
GET parameters.  To follow links contianing these, would I need to tweak
crawl-urlfilter.txt to remove the following line:

# skip URLs containing certain characters as probable queries, etc.
[EMAIL PROTECTED]

I don't think this is the whole problem, because the root url
for the gallery has been fetched/ indexed.  This page contains
links that are not queryies (i.e. contain ?). 

Thanks in advance for any help you can offer.

Andy

-- 
Andrew Libby                                  
[EMAIL PROTECTED]
http://philadelphiariders.com/

Nutch crawl not fetching portions of site

Reply via email to