yes, its permanently redirected. you can check also the segment status of this url here is an example
reinh...@thord:>bin/nutch readseg -get crawl/segments/20091028122455 "http://www.krems.at/fotoalbum/fotoalbum.asp?albumid=37&big=1&seitenid=20" it will show you whether it is parsed and the extracted outlinks. it will show any data related to this url stored in the segment. regards caezar schrieb: > Thanks, that was really helpful. I've moved forward but still not found the > solution. > So the status of the initial URL > (http://www.1stdirectory.com/Companies/1627406_ins_Catering_Limited.htm) is: > Status: 5 (db_redir_perm) > Metadata: _pst_: moved(12), lastModified=0: > http://www.1stdirectory.com/Companies/1627406_Darwins_Catering_Limited.htm > > So it answers the question, why initial page was not indexed - because it > was redirected. > Now checking the status of redirect target: > Status: 2 (db_fetched) > > So it was sucessfully fetchet. But, according to indexing log - it still was > not sent to indexer! > > > > reinhard schwab wrote: > >> what is the db status of this url in your crawl db? >> if it is STATUS_DB_NOTMODIFIED, >> then it may be the reason. >> (you can check it if you dump your crawl db with >> reinh...@thord:>bin/nutch readdb <crawldb> -url <url> >> >> it has this status, if it is recrawled and the signature does not change. >> the signature is MD5 hash of the content. >> >> another reason may be that you have some indexing filters. >> i dont believe its the reason here. >> >> regards >> >> >> kevin chen schrieb: >> >>> I have similar experience. >>> >>> Reinhard schwab responded a possible fix. See mail in this group from >>> Reinhard schwab at >>> Sun, 25 Oct 2009 10:03:41 +0100 (05:03 EDT) >>> >>> I haven't have chance to try it out yet. >>> >>> On Tue, 2009-10-27 at 07:34 -0700, caezar wrote: >>> >>> >>>> Hi All, >>>> >>>> I've got a strange problem, that nutch indexes much less URLs then it >>>> fetches. For example URL: >>>> http://www.1stdirectory.com/Companies/1627406_ins_Catering_Limited.htm. >>>> I assume that if fetched sucessfully because in fetch logs it mentioned >>>> only >>>> once: >>>> 2009-10-26 10:01:46,502 INFO org.apache.nutch.fetcher.Fetcher: fetching >>>> http://www.1stdirectory.com/Companies/1627406_ins_Catering_Limited.htm >>>> >>>> But it was not sent to the indexer on indexing phase (I'm using custom >>>> NutchIndexWriter and it logs every page for witch it's write method >>>> executed). What could be possible reason? Is there a way to browse >>>> crawldb >>>> to ensure that page really fetched? What else could I check? >>>> >>>> Thanks >>>> >>>> >>> >>> >> >> > >