Juho, thanks, that was what I was looking for.

What I still don't understand: When is this URL-Normalization done? Or more precisely: What will I have to do with my already crawled pages? Reindex? Update the db? A simple dedup did not seem to do the job...

Regards,

Hans Benedict

_________________________________________________________________
Chemie.DE Information Service GmbH     Hans Benedict
Seydelstraße 28                        mailto: [EMAIL PROTECTED]
10117 Berlin, Germany                  Tel +49 30 204568-40
                                      Fax +49 30 204568-70

www.Chemie.DE | www.ChemieKarriere.NET www.Bionity.COM | www.BioKarriere.NET


Juho Mäkinen wrote:

Take a look under conf/regex-normalize.xml

I don't know how it works, but it seems to do just what you need,
removing session data from GET urls. It's been configured to
remove PHPSESSID variables on default, but you should be
easily able to figure how to custome it for your needs.

- Juho Mäkinen, http://www.juhonkoti.net

On 6/27/05, Hans Benedict <[EMAIL PROTECTED]> wrote:
Hi,

I am crawling some sites that use session ids. As the crawler does not
use cookies, they are put in the url's querystring. This results in
thousands of pages that are - based on the visible content - duplicates,
but are detected as such, because the urls contained in the html are
different.

Has anybody found a solution to this problem? Is there a way to activate
cookies for the crawler?

--
Kind regards,

Hans Benedict

_________________________________________________________________
Chemie.DE Information Service GmbH     Hans Benedict
Seydelstraße 28                        mailto: [EMAIL PROTECTED]
10117 Berlin, Germany                  Tel +49 30 204568-40
                                      Fax +49 30 204568-70

www.Chemie.DE               |          www.ChemieKarriere.NET
www.Bionity.COM             |          www.BioKarriere.NET


Reply via email to