Andy Liu wrote:
URL normalization occurs during parsing. If your index isn't that
big, it may be easier to start your crawl from scratch.
Can I do parsing without re-fetching? Or are only the parsed data stored on
disk?
Can I re-fetch only some servers while keeping the data of the other servers intact? (It's only a handful of my servers that use session ids.)
Will the old pages whith badly normalized urls get overwritten by the new ones
or will I have to delete them manually?
Thanks for your help!
Regards,
Hans Benedict
Andy Liu wrote:
URL normalization occurs during parsing. If your index isn't that
big, it may be easier to start your crawl from scratch.
On 6/29/05, Hans Benedict <[EMAIL PROTECTED]> wrote:
Juho, thanks, that was what I was looking for.
What I still don't understand: When is this URL-Normalization done? Or
more precisely: What will I have to do with my already crawled pages?
Reindex? Update the db? A simple dedup did not seem to do the job...
Regards,
Hans Benedict
_________________________________________________________________
Chemie.DE Information Service GmbH Hans Benedict
Seydelstraße 28 mailto: [EMAIL PROTECTED]
10117 Berlin, Germany Tel +49 30 204568-40
Fax +49 30 204568-70
www.Chemie.DE | www.ChemieKarriere.NET
www.Bionity.COM | www.BioKarriere.NET
Juho Mäkinen wrote:
Take a look under conf/regex-normalize.xml
I don't know how it works, but it seems to do just what you need,
removing session data from GET urls. It's been configured to
remove PHPSESSID variables on default, but you should be
easily able to figure how to custome it for your needs.
- Juho Mäkinen, http://www.juhonkoti.net
On 6/27/05, Hans Benedict <[EMAIL PROTECTED]> wrote:
Hi,
I am crawling some sites that use session ids. As the crawler does not
use cookies, they are put in the url's querystring. This results in
thousands of pages that are - based on the visible content - duplicates,
but are detected as such, because the urls contained in the html are
different.
Has anybody found a solution to this problem? Is there a way to activate
cookies for the crawler?
--
Kind regards,
Hans Benedict
_________________________________________________________________
Chemie.DE Information Service GmbH Hans Benedict
Seydelstraße 28 mailto: [EMAIL PROTECTED]
10117 Berlin, Germany Tel +49 30 204568-40
Fax +49 30 204568-70
www.Chemie.DE | www.ChemieKarriere.NET
www.Bionity.COM | www.BioKarriere.NET
--
Hans