Tomi NA wrote: > On 9/7/06, David Wallace <[EMAIL PROTECTED]> wrote: >> Just guessing, but could this be caused by session ids in the URL? Or >> some other unimportant piece of data? If this is the case, then every >> page would be added to the index when it's crawled, regardless of >> whether it's already in there, with a different session id. If this is >> what's causing your problem, then you need to use the regexp URL >> normaliser to strip out the session ids. > > Nice try but no luck, I'm afraid. > The complete web is absolutely static. The reason is that we've set up > IIS (I'm not too happy choosing IIS over apache) to serve files from a > shared directory on the same server, the rationale beeing that we'd > rather have http://-type links than file://. >> From what I've seen in the logs, I don't see URLs varying so I'm still > at square one. Still, thanks for the effort. If you have any other > ideas, I'm eager to hear them.
The best way to discover what's going on is to start from a small subset of injected urls, and do the following: * inject * dump the db to a text file * generate / fetch / updatedb * dump the db again to a second text file * compare the files. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __________________________________ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com ------------------------------------------------------------------------- Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
