On 9/8/06, Andrzej Bialecki <[EMAIL PROTECTED]> wrote: > Tomi NA wrote: > > On 9/7/06, David Wallace <[EMAIL PROTECTED]> wrote: > >> Just guessing, but could this be caused by session ids in the URL? Or > >> some other unimportant piece of data? If this is the case, then every > >> page would be added to the index when it's crawled, regardless of > >> whether it's already in there, with a different session id. If this is > >> what's causing your problem, then you need to use the regexp URL > >> normaliser to strip out the session ids. > > > > Nice try but no luck, I'm afraid. > > The complete web is absolutely static. The reason is that we've set up > > IIS (I'm not too happy choosing IIS over apache) to serve files from a > > shared directory on the same server, the rationale beeing that we'd > > rather have http://-type links than file://. > >> From what I've seen in the logs, I don't see URLs varying so I'm still > > at square one. Still, thanks for the effort. If you have any other > > ideas, I'm eager to hear them. > > The best way to discover what's going on is to start from a small subset > of injected urls, and do the following: > > * inject > > * dump the db to a text file > > * generate / fetch / updatedb > > * dump the db again to a second text file > > * compare the files.
I'll see if I'm able to reproduce those steps here, thanks. t.n.a. ------------------------------------------------------------------------- Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
