On 9/8/06, Andrzej Bialecki <[EMAIL PROTECTED]> wrote:
> Tomi NA wrote:
> > On 9/7/06, David Wallace <[EMAIL PROTECTED]> wrote:
> >> Just guessing, but could this be caused by session ids in the URL?  Or
> >> some other unimportant piece of data?  If this is the case, then every
> >> page would be added to the index when it's crawled, regardless of
> >> whether it's already in there, with a different session id.  If this is
> >> what's causing your problem, then you need to use the regexp URL
> >> normaliser to strip out the session ids.
> >
> > Nice try but no luck, I'm afraid.
> > The complete web is absolutely static. The reason is that we've set up
> > IIS (I'm not too happy choosing IIS over apache) to serve files from a
> > shared directory on the same server, the rationale beeing that we'd
> > rather have http://-type links than file://.
> >> From what I've seen in the logs, I don't see URLs varying so I'm still
> > at square one. Still, thanks for the effort. If you have any other
> > ideas, I'm eager to hear them.
>
> The best way to discover what's going on is to start from a small subset
> of injected urls, and do the following:
>
> * inject
>
> * dump the db to a text file
>
> * generate / fetch / updatedb
>
> * dump the db again to a second text file
>
> * compare the files.

I'll see if I'm able to reproduce those steps here, thanks.

t.n.a.

-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to