Hi, What exactly happens if page is redirected? Is the original url updated in db as well after the new url is properly fetched when running updatedb? I am looking into Fetcher.java but some warm human word would help - also I noticed that Fetcher does not have any unit test class... :-(
Regards, Lukas On 5/19/06, Lukas Vlcek <[EMAIL PROTECTED]> wrote:
Hi Andrzej, I am sorry for the late reply. I haven't had a chance to prepare these dump files for you yet but I made one interesting observation which could shed some light on the problem. I turned on http and fetcher verbose logging and it seems that all these three urls redirects fetcher to the same page. I have a lot of un_fetched url links in database but lot of them does not point to any real document (as the original document is gone) and the server redirects to the default page (home page or "can't find this page" ... etc). Do you think this information could help us now? Anyway, I'll try to prepare those dump files for you (I don't have much experience with segread command so far). However, I tied the newest SVN nutch-.08 today with the same result. My current settings for redirects: <name>http.redirect.max</name> <value>3</value> Regards, Lukas On 5/17/06, Andrzej Bialecki <[EMAIL PROTECTED]> wrote: > Lukas Vlcek wrote: > > Hi Andrzej, > > > > nutch-site.xml says: > > <name>db.default.fetch.interval</name> > > <value>15</value> > > > > I tried readdb -dump. > > I am not an expert in dump output but to me it seems that db is not > > updated. > > I have two dump output (pre and post) and diffing then I found the > > following differencies: > > 1) Some score values were changed. > > 2) Only one fetch time for one document has been changed but that is > > not any of that three fatched pages... > > > > I also checked these three pages and they are still unfetched. > > > > Wow that seems very strange... > > Any idea? > > Ok, this could indicate some bugs in either Generate or CrawlDbReducer > (both of which has been recently changes in a couple places). Could you > please do the following: > > * prepare a fragment of the crawldb dump with the data about these three > pages. > > * generate, so that you get these three pages in the fetchlist (easy to > check with segread). > > * fetch > > * prepare a fragment of the segment dump (segread -dump) with the data > about these pages > > * run updatedb > > * prepare a fragment of the crawldb dump after updating > > And then package this data nicely and send them to me. Thanks! > > -- > Best regards, > Andrzej Bialecki <>< > ___. ___ ___ ___ _ _ __________________________________ > [__ || __|__/|__||\/| Information Retrieval, Semantic Web > ___|||__|| \| || | Embedded Unix, System Integration > http://www.sigram.com Contact: info at sigram dot com > > >
------------------------------------------------------- Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnk&kid0709&bid&3057&dat1642 _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
