[Nutch-general] Re: Nutch crawl not fetching portions of site

Andrew Libby Tue, 18 Apr 2006 11:05:08 -0700

The odd part is that they are in the linkdb, which the would not be if
they were
in the filter, am I right?  Output in the initial message I sent shows a
few of these:


./bin/nutch readlinkdb ../nutch-index/linkdb -dump joe

yields:

http://philadelphiariders.com/gallery/2005-Events   Inlinks:
 fromUrl: http://www.philadelphiariders.com/gallery/ anchor:

http://philadelphiariders.com/gallery/2006-Events   Inlinks:
 fromUrl: http://www.philadelphiariders.com/gallery/ anchor:
 fromUrl: http://www.philadelphiariders.com/gallery/ anchor: 2006 Events

http://philadelphiariders.com/gallery/2nd-Sunday-Rides  Inlinks:
 fromUrl: http://www.philadelphiariders.com/gallery/ anchor:
 fromUrl: http://www.philadelphiariders.com/c/ anchor: 2nd Sunday Rides
 fromUrl: http://www.philadelphiariders.com/gallery/ anchor: 2nd Sunday
Rides

http://philadelphiariders.com/gallery/April-2006    Inlinks:
 fromUrl: http://www.philadelphiariders.com/c/ anchor: April 2006

http://philadelphiariders.com/gallery/Marilyns-Photos   Inlinks:
 fromUrl: http://www.philadelphiariders.com/c/ anchor: Marilyn's Photos


Is my understanding/ assumption accurate?    As you can see the links
above do not
contain query characters '?', but the Gallery application does use these
in page navigation.

Thanks.

Andy



Dennis Kubes wrote:

> It is possible that the URL filter is preventing the links from being
> crawled, especially if they have characters such as ? or ; in them
> (i.e. like a php session id).  Can you post an example of a link?
>
> Dennis
>
> Andrew Libby wrote:
>
>> Hello,
>>
>> I have what I assume to be a simple user issue with nutch-0.8-dev.  I'm
>> using Nutch
>> to do a single site crawl on a Fedora Core 4 Linux machine.  The site
>> I'm crawling consists
>> of Perl (Catalyst to be specific), and PHP (an app called gallery, and
>> an instance of Media Wiki).
>>
>> The issue I'm having is that Nutch does not seem to crawl the gallery
>> section of the site.
>> There are links from the main site to gallery, and I've listed the top
>> level gallery URL
>> my initial url list I pass to nutch crawl.
>>
>> Sorry for the length of the message, but I wanted to try to provide as
>> much information about
>> the problem as I could.
>>
>> Nutch does crawl the wiki and perl sections of the site.
>>
>> Crawl Command:
>>
>> nutch crawl urls -dir ../nutch-index -depth 25 -topN10000
>>
>> The urls dir contains one file called urls.txt:
>>
>> http://www.philadelphiariders.com/
>> http://www.philadelphiariders.com/c/dmoz/Top.html
>> http://www.philadelphiariders.com/gallery/
>>
>> The only change I've nade to crawl-urlfilter.txt is:
>>
>> +^http://www.philadelphiariders.com/
>>
>> which I replaced the example regex rule that was there out of the box.
>>
>> In the index output, I see a reference to the gallery:
>>
>>  Indexing [http://www.philadelphiariders.com/gallery/] with analyzer
>> [EMAIL PROTECTED] (null)
>>
>> But the rest of the gallery is not referenced in index output.
>> The command  ./bin/nutch readdb ../nutch-index/crawldb -dump ./dumpdata
>> Has only these two entries referencing the gallery.    Does the
>> Status of
>> view_album.php have anything to do with my issue?
>> http://www.philadelphiariders.com/gallery/  Version: 4
>> Status: 2 (DB_fetched)
>> Fetch time: Tue May 16 15:20:15 EDT 2006
>> Modified time: Wed Dec 31 19:00:00 EST 1969
>> Retries since fetch: 0
>> Retry interval: 30.0 days
>> Score: 316.0114
>> Signature: b7619f18442c6f356f802ba7847dc127
>>
>> http://www.philadelphiariders.com/gallery/view_album.php    Version: 4
>> Status: 3 (DB_gone)
>> Fetch time: Sun Apr 16 15:21:12 EDT 2006
>> Modified time: Wed Dec 31 19:00:00 EST 1969
>> Retries since fetch: 0
>> Retry interval: 30.0 days
>> Score: 2.0824916
>> Signature: null
>>
>> Links that are not indexed are in the linkdb:
>>
>> ./bin/nutch readlinkdb ../nutch-index/linkdb -dump joe
>>
>> yields:
>>
>> http://philadelphiariders.com/gallery/2005-Events   Inlinks:
>>  fromUrl: http://www.philadelphiariders.com/gallery/ anchor:
>>
>> http://philadelphiariders.com/gallery/2006-Events   Inlinks:
>>  fromUrl: http://www.philadelphiariders.com/gallery/ anchor:
>>  fromUrl: http://www.philadelphiariders.com/gallery/ anchor: 2006 Events
>>
>> http://philadelphiariders.com/gallery/2nd-Sunday-Rides  Inlinks:
>>  fromUrl: http://www.philadelphiariders.com/gallery/ anchor:
>>  fromUrl: http://www.philadelphiariders.com/c/ anchor: 2nd Sunday Rides
>>  fromUrl: http://www.philadelphiariders.com/gallery/ anchor: 2nd Sunday
>> Rides
>>
>> http://philadelphiariders.com/gallery/April-2006    Inlinks:
>>  fromUrl: http://www.philadelphiariders.com/c/ anchor: April 2006
>>
>> http://philadelphiariders.com/gallery/Marilyns-Photos   Inlinks:
>>  fromUrl: http://www.philadelphiariders.com/c/ anchor: Marilyn's Photos
>>
>> http://philadelphiariders.com/gallery/Rider-Gallery Inlinks:
>>  fromUrl: http://www.philadelphiariders.com/gallery/ anchor: Rider
>> Gallery
>>  fromUrl: http://www.philadelphiariders.com/gallery/ anchor:
>>
>> Also, a lot fo the navigation in the Gallery application makes use of
>> GET parameters.  To follow links contianing these, would I need to tweak
>> crawl-urlfilter.txt to remove the following line:
>>
>> # skip URLs containing certain characters as probable queries, etc.
>> [EMAIL PROTECTED]
>>
>> I don't think this is the whole problem, because the root url
>> for the gallery has been fetched/ indexed.  This page contains
>> links that are not queryies (i.e. contain ?).
>> Thanks in advance for any help you can offer.
>>
>> Andy
>>
>>   
>
>


-- 
Andrew Libby                                  
[EMAIL PROTECTED]
http://philadelphiariders.com/




-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting language
that extends applications into web and mobile media. Attend the live webcast
and join the prime developer group breaking into this new coding territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

[Nutch-general] Re: Nutch crawl not fetching portions of site

Reply via email to