The odd part is that they are in the linkdb, which the would not be if they were in the filter, am I right? Output in the initial message I sent shows a few of these:
./bin/nutch readlinkdb ../nutch-index/linkdb -dump joe yields: http://philadelphiariders.com/gallery/2005-Events Inlinks: fromUrl: http://www.philadelphiariders.com/gallery/ anchor: http://philadelphiariders.com/gallery/2006-Events Inlinks: fromUrl: http://www.philadelphiariders.com/gallery/ anchor: fromUrl: http://www.philadelphiariders.com/gallery/ anchor: 2006 Events http://philadelphiariders.com/gallery/2nd-Sunday-Rides Inlinks: fromUrl: http://www.philadelphiariders.com/gallery/ anchor: fromUrl: http://www.philadelphiariders.com/c/ anchor: 2nd Sunday Rides fromUrl: http://www.philadelphiariders.com/gallery/ anchor: 2nd Sunday Rides http://philadelphiariders.com/gallery/April-2006 Inlinks: fromUrl: http://www.philadelphiariders.com/c/ anchor: April 2006 http://philadelphiariders.com/gallery/Marilyns-Photos Inlinks: fromUrl: http://www.philadelphiariders.com/c/ anchor: Marilyn's Photos Is my understanding/ assumption accurate? As you can see the links above do not contain query characters '?', but the Gallery application does use these in page navigation. Thanks. Andy Dennis Kubes wrote: > It is possible that the URL filter is preventing the links from being > crawled, especially if they have characters such as ? or ; in them > (i.e. like a php session id). Can you post an example of a link? > > Dennis > > Andrew Libby wrote: > >> Hello, >> >> I have what I assume to be a simple user issue with nutch-0.8-dev. I'm >> using Nutch >> to do a single site crawl on a Fedora Core 4 Linux machine. The site >> I'm crawling consists >> of Perl (Catalyst to be specific), and PHP (an app called gallery, and >> an instance of Media Wiki). >> >> The issue I'm having is that Nutch does not seem to crawl the gallery >> section of the site. >> There are links from the main site to gallery, and I've listed the top >> level gallery URL >> my initial url list I pass to nutch crawl. >> >> Sorry for the length of the message, but I wanted to try to provide as >> much information about >> the problem as I could. >> >> Nutch does crawl the wiki and perl sections of the site. >> >> Crawl Command: >> >> nutch crawl urls -dir ../nutch-index -depth 25 -topN10000 >> >> The urls dir contains one file called urls.txt: >> >> http://www.philadelphiariders.com/ >> http://www.philadelphiariders.com/c/dmoz/Top.html >> http://www.philadelphiariders.com/gallery/ >> >> The only change I've nade to crawl-urlfilter.txt is: >> >> +^http://www.philadelphiariders.com/ >> >> which I replaced the example regex rule that was there out of the box. >> >> In the index output, I see a reference to the gallery: >> >> Indexing [http://www.philadelphiariders.com/gallery/] with analyzer >> [EMAIL PROTECTED] (null) >> >> But the rest of the gallery is not referenced in index output. >> The command ./bin/nutch readdb ../nutch-index/crawldb -dump ./dumpdata >> Has only these two entries referencing the gallery. Does the >> Status of >> view_album.php have anything to do with my issue? >> http://www.philadelphiariders.com/gallery/ Version: 4 >> Status: 2 (DB_fetched) >> Fetch time: Tue May 16 15:20:15 EDT 2006 >> Modified time: Wed Dec 31 19:00:00 EST 1969 >> Retries since fetch: 0 >> Retry interval: 30.0 days >> Score: 316.0114 >> Signature: b7619f18442c6f356f802ba7847dc127 >> >> http://www.philadelphiariders.com/gallery/view_album.php Version: 4 >> Status: 3 (DB_gone) >> Fetch time: Sun Apr 16 15:21:12 EDT 2006 >> Modified time: Wed Dec 31 19:00:00 EST 1969 >> Retries since fetch: 0 >> Retry interval: 30.0 days >> Score: 2.0824916 >> Signature: null >> >> Links that are not indexed are in the linkdb: >> >> ./bin/nutch readlinkdb ../nutch-index/linkdb -dump joe >> >> yields: >> >> http://philadelphiariders.com/gallery/2005-Events Inlinks: >> fromUrl: http://www.philadelphiariders.com/gallery/ anchor: >> >> http://philadelphiariders.com/gallery/2006-Events Inlinks: >> fromUrl: http://www.philadelphiariders.com/gallery/ anchor: >> fromUrl: http://www.philadelphiariders.com/gallery/ anchor: 2006 Events >> >> http://philadelphiariders.com/gallery/2nd-Sunday-Rides Inlinks: >> fromUrl: http://www.philadelphiariders.com/gallery/ anchor: >> fromUrl: http://www.philadelphiariders.com/c/ anchor: 2nd Sunday Rides >> fromUrl: http://www.philadelphiariders.com/gallery/ anchor: 2nd Sunday >> Rides >> >> http://philadelphiariders.com/gallery/April-2006 Inlinks: >> fromUrl: http://www.philadelphiariders.com/c/ anchor: April 2006 >> >> http://philadelphiariders.com/gallery/Marilyns-Photos Inlinks: >> fromUrl: http://www.philadelphiariders.com/c/ anchor: Marilyn's Photos >> >> http://philadelphiariders.com/gallery/Rider-Gallery Inlinks: >> fromUrl: http://www.philadelphiariders.com/gallery/ anchor: Rider >> Gallery >> fromUrl: http://www.philadelphiariders.com/gallery/ anchor: >> >> Also, a lot fo the navigation in the Gallery application makes use of >> GET parameters. To follow links contianing these, would I need to tweak >> crawl-urlfilter.txt to remove the following line: >> >> # skip URLs containing certain characters as probable queries, etc. >> [EMAIL PROTECTED] >> >> I don't think this is the whole problem, because the root url >> for the gallery has been fetched/ indexed. This page contains >> links that are not queryies (i.e. contain ?). >> Thanks in advance for any help you can offer. >> >> Andy >> >> > > -- Andrew Libby [EMAIL PROTECTED] http://philadelphiariders.com/ ------------------------------------------------------- This SF.Net email is sponsored by xPML, a groundbreaking scripting language that extends applications into web and mobile media. Attend the live webcast and join the prime developer group breaking into this new coding territory! http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642 _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
