Re: dedup vs. session ids

Hans Wed, 29 Jun 2005 08:53:24 -0700

Andy Liu wrote:

URL normalization occurs during parsing.  If your index isn't that
big, it may be easier to start your crawl from scratch.


Can I do parsing without re-fetching? Or are only the parsed data stored on 
disk?

Can I re-fetch only some servers while keeping the data of the other servers intact? (It's only a handful of my servers that use session ids.)

Will the old pages whith badly normalized urls get overwritten by the new ones 
or will I have to delete them manually?

Thanks for your help!

Regards,

Hans Benedict




Andy Liu wrote:

URL normalization occurs during parsing.  If your index isn't that
big, it may be easier to start your crawl from scratch.

On 6/29/05, Hans Benedict <[EMAIL PROTECTED]> wrote:

Juho, thanks, that was what I was looking for.

What I still don't understand: When is this URL-Normalization done? Or
more precisely: What will I have to do with my already crawled pages?
Reindex? Update the db? A simple dedup did not seem to do the job...

Regards,

Hans Benedict

_________________________________________________________________
Chemie.DE Information Service GmbH     Hans Benedict
Seydelstraße 28                        mailto: [EMAIL PROTECTED]
10117 Berlin, Germany                  Tel +49 30 204568-40
                                      Fax +49 30 204568-70

www.Chemie.DE               |          www.ChemieKarriere.NET
www.Bionity.COM             |          www.BioKarriere.NET



Juho Mäkinen wrote:

Take a look under conf/regex-normalize.xml

I don't know how it works, but it seems to do just what you need,
removing session data from GET urls. It's been configured to
remove PHPSESSID variables on default, but you should be
easily able to figure how to custome it for your needs.

- Juho Mäkinen, http://www.juhonkoti.net

On 6/27/05, Hans Benedict <[EMAIL PROTECTED]> wrote:

Hi,

I am crawling some sites that use session ids. As the crawler does not
use cookies, they are put in the url's querystring. This results in
thousands of pages that are - based on the visible content - duplicates,
but are detected as such, because the urls contained in the html are
different.

Has anybody found a solution to this problem? Is there a way to activate
cookies for the crawler?

--
Kind regards,

Hans Benedict

_________________________________________________________________
Chemie.DE Information Service GmbH     Hans Benedict
Seydelstraße 28                        mailto: [EMAIL PROTECTED]
10117 Berlin, Germany                  Tel +49 30 204568-40
                                     Fax +49 30 204568-70

www.Chemie.DE               |          www.ChemieKarriere.NET
www.Bionity.COM             |          www.BioKarriere.NET


--
Hans

Re: dedup vs. session ids

Reply via email to