Update - urgent issue?

Lukas Vlcek Fri, 19 May 2006 13:44:08 -0700

Hi,

What exactly happens if page is redirected? Is the original url
updated in db as well after the new url is properly fetched when
running updatedb?
I am looking into Fetcher.java but some warm human word would help -
also I noticed that Fetcher does not have any unit test class... :-(


Regards,
Lukas

On 5/19/06, Lukas Vlcek <[EMAIL PROTECTED]> wrote:

Hi Andrzej,

I am sorry for the late reply. I haven't had a chance to prepare these
dump files for you yet but I made one interesting observation which
could shed some light on the problem.
I turned on http and fetcher verbose logging and it seems that all
these three urls redirects fetcher to the same page.

I have a lot of un_fetched url links in database but lot of them does
not point to any real document (as the original document is gone) and
the server redirects to the default page (home page or "can't find
this page" ... etc). Do you think this information could help us now?

Anyway, I'll try to prepare those dump files for you (I don't have
much experience with segread command so far). However, I tied the
newest SVN nutch-.08 today with the same result.

My current settings for redirects:
<name>http.redirect.max</name>
<value>3</value>

Regards,
Lukas

On 5/17/06, Andrzej Bialecki <[EMAIL PROTECTED]> wrote:
> Lukas Vlcek wrote:
> > Hi Andrzej,
> >
> > nutch-site.xml says:
> > <name>db.default.fetch.interval</name>
> > <value>15</value>
> >
> > I tried readdb -dump.
> > I am not an expert in dump output but to me it seems that db is not
> > updated.
> > I have two dump output (pre and post) and diffing then I found the
> > following differencies:
> > 1) Some score values were changed.
> > 2) Only one fetch time for one document has been changed but that is
> > not any of that three fatched pages...
> >
> > I also checked these three pages and they are still unfetched.
> >
> > Wow that seems very strange...
> > Any idea?
>
> Ok, this could indicate some bugs in either Generate or CrawlDbReducer
> (both of which has been recently changes in a couple places). Could you
> please do the following:
>
> * prepare a fragment of the crawldb dump with the data about these three
> pages.
>
> * generate, so that you get these three pages in the fetchlist (easy to
> check with segread).
>
> * fetch
>
> * prepare a fragment of the segment dump (segread -dump) with the data
> about these pages
>
> * run updatedb
>
> * prepare a fragment of the crawldb dump after updating
>
> And then package this data nicely and send them to me. Thanks!
>
> --
> Best regards,
> Andrzej Bialecki     <><
>  ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>
>



-------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid0709&bid&3057&dat1642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

[Nutch-general] Re: Generalte/Fetch/Update - urgent issue?

Reply via email to