URL normalization occurs during parsing. If your index isn't that big, it may be easier to start your crawl from scratch.
On 6/29/05, Hans Benedict <[EMAIL PROTECTED]> wrote: > Juho, thanks, that was what I was looking for. > > What I still don't understand: When is this URL-Normalization done? Or > more precisely: What will I have to do with my already crawled pages? > Reindex? Update the db? A simple dedup did not seem to do the job... > > Regards, > > Hans Benedict > > _________________________________________________________________ > Chemie.DE Information Service GmbH Hans Benedict > Seydelstraße 28 mailto: [EMAIL PROTECTED] > 10117 Berlin, Germany Tel +49 30 204568-40 > Fax +49 30 204568-70 > > www.Chemie.DE | www.ChemieKarriere.NET > www.Bionity.COM | www.BioKarriere.NET > > > > Juho Mäkinen wrote: > > >Take a look under conf/regex-normalize.xml > > > >I don't know how it works, but it seems to do just what you need, > >removing session data from GET urls. It's been configured to > >remove PHPSESSID variables on default, but you should be > >easily able to figure how to custome it for your needs. > > > > - Juho Mäkinen, http://www.juhonkoti.net > > > >On 6/27/05, Hans Benedict <[EMAIL PROTECTED]> wrote: > > > > > >>Hi, > >> > >>I am crawling some sites that use session ids. As the crawler does not > >>use cookies, they are put in the url's querystring. This results in > >>thousands of pages that are - based on the visible content - duplicates, > >>but are detected as such, because the urls contained in the html are > >>different. > >> > >>Has anybody found a solution to this problem? Is there a way to activate > >>cookies for the crawler? > >> > >>-- > >>Kind regards, > >> > >>Hans Benedict > >> > >>_________________________________________________________________ > >>Chemie.DE Information Service GmbH Hans Benedict > >>Seydelstraße 28 mailto: [EMAIL PROTECTED] > >>10117 Berlin, Germany Tel +49 30 204568-40 > >> Fax +49 30 204568-70 > >> > >>www.Chemie.DE | www.ChemieKarriere.NET > >>www.Bionity.COM | www.BioKarriere.NET > >> > >> > >> > >> >
