Tomi NA wrote:
On 9/7/06, David Wallace <[EMAIL PROTECTED]> wrote:
Just guessing, but could this be caused by session ids in the URL?  Or
some other unimportant piece of data?  If this is the case, then every
page would be added to the index when it's crawled, regardless of
whether it's already in there, with a different session id.  If this is
what's causing your problem, then you need to use the regexp URL
normaliser to strip out the session ids.

Nice try but no luck, I'm afraid.
The complete web is absolutely static. The reason is that we've set up
IIS (I'm not too happy choosing IIS over apache) to serve files from a
shared directory on the same server, the rationale beeing that we'd
rather have http://-type links than file://.
From what I've seen in the logs, I don't see URLs varying so I'm still
at square one. Still, thanks for the effort. If you have any other
ideas, I'm eager to hear them.

The best way to discover what's going on is to start from a small subset of injected urls, and do the following:

* inject

* dump the db to a text file

* generate / fetch / updatedb

* dump the db again to a second text file

* compare the files.

--
Best regards,
Andrzej Bialecki     <><
___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Reply via email to