URL normalization occurs during parsing.  If your index isn't that
big, it may be easier to start your crawl from scratch.

On 6/29/05, Hans Benedict <[EMAIL PROTECTED]> wrote:
> Juho, thanks, that was what I was looking for.
> 
> What I still don't understand: When is this URL-Normalization done? Or
> more precisely: What will I have to do with my already crawled pages?
> Reindex? Update the db? A simple dedup did not seem to do the job...
> 
> Regards,
> 
> Hans Benedict
> 
> _________________________________________________________________
> Chemie.DE Information Service GmbH     Hans Benedict
> Seydelstraße 28                        mailto: [EMAIL PROTECTED]
> 10117 Berlin, Germany                  Tel +49 30 204568-40
>                                        Fax +49 30 204568-70
> 
> www.Chemie.DE               |          www.ChemieKarriere.NET
> www.Bionity.COM             |          www.BioKarriere.NET
> 
> 
> 
> Juho Mäkinen wrote:
> 
> >Take a look under conf/regex-normalize.xml
> >
> >I don't know how it works, but it seems to do just what you need,
> >removing session data from GET urls. It's been configured to
> >remove PHPSESSID variables on default, but you should be
> >easily able to figure how to custome it for your needs.
> >
> > - Juho Mäkinen, http://www.juhonkoti.net
> >
> >On 6/27/05, Hans Benedict <[EMAIL PROTECTED]> wrote:
> >
> >
> >>Hi,
> >>
> >>I am crawling some sites that use session ids. As the crawler does not
> >>use cookies, they are put in the url's querystring. This results in
> >>thousands of pages that are - based on the visible content - duplicates,
> >>but are detected as such, because the urls contained in the html are
> >>different.
> >>
> >>Has anybody found a solution to this problem? Is there a way to activate
> >>cookies for the crawler?
> >>
> >>--
> >>Kind regards,
> >>
> >>Hans Benedict
> >>
> >>_________________________________________________________________
> >>Chemie.DE Information Service GmbH     Hans Benedict
> >>Seydelstraße 28                        mailto: [EMAIL PROTECTED]
> >>10117 Berlin, Germany                  Tel +49 30 204568-40
> >>                                       Fax +49 30 204568-70
> >>
> >>www.Chemie.DE               |          www.ChemieKarriere.NET
> >>www.Bionity.COM             |          www.BioKarriere.NET
> >>
> >>
> >>
> >>
>

Reply via email to