Tomi NA wrote:
> On 9/7/06, David Wallace <[EMAIL PROTECTED]> wrote:
>> Just guessing, but could this be caused by session ids in the URL?  Or
>> some other unimportant piece of data?  If this is the case, then every
>> page would be added to the index when it's crawled, regardless of
>> whether it's already in there, with a different session id.  If this is
>> what's causing your problem, then you need to use the regexp URL
>> normaliser to strip out the session ids.
>
> Nice try but no luck, I'm afraid.
> The complete web is absolutely static. The reason is that we've set up
> IIS (I'm not too happy choosing IIS over apache) to serve files from a
> shared directory on the same server, the rationale beeing that we'd
> rather have http://-type links than file://.
>> From what I've seen in the logs, I don't see URLs varying so I'm still
> at square one. Still, thanks for the effort. If you have any other
> ideas, I'm eager to hear them.

The best way to discover what's going on is to start from a small subset 
of injected urls, and do the following:

* inject

* dump the db to a text file

* generate / fetch / updatedb

* dump the db again to a second text file

* compare the files.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to