On 9/7/06, David Wallace <[EMAIL PROTECTED]> wrote: > Just guessing, but could this be caused by session ids in the URL? Or > some other unimportant piece of data? If this is the case, then every > page would be added to the index when it's crawled, regardless of > whether it's already in there, with a different session id. If this is > what's causing your problem, then you need to use the regexp URL > normaliser to strip out the session ids.
Nice try but no luck, I'm afraid. The complete web is absolutely static. The reason is that we've set up IIS (I'm not too happy choosing IIS over apache) to serve files from a shared directory on the same server, the rationale beeing that we'd rather have http://-type links than file://. >From what I've seen in the logs, I don't see URLs varying so I'm still at square one. Still, thanks for the effort. If you have any other ideas, I'm eager to hear them. t.n.a. ------------------------------------------------------------------------- Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
