Re: Crawling + Indexing staging vs. production and URL conflict

Andrzej Bialecki Fri, 30 Mar 2007 07:04:30 -0800

[EMAIL PROTECTED] wrote:

Hi,


I'm using Nutch to index a single site.  I have a need to crawl/fetch/index the 
staging version of the site and then using the resulting index for searching of 
the production site.  The problem is that staging and production sites have 
different URLs, for example:

  Staging:
    http://STAGING.example.com/foo/bar.html

  Production:
    http://WWW.example.com/foo/bar.html

What I'd like to be able to do do is index the staging site and then just push 
the index to production and have it work for production searches.  Obviously, 
the links stored in the index would be wrong (STAGING.example.com vs. 
WWW.example.com).

What is the best way to accomplish this?

One thing I was thinking was to index the staging site, then open up CrawlDb 
and LinkDb (any others?), loop through them and write out a new version of 
those files, changing the keys (URLs) along the way, for instance from 
http://STAGING.example.com/foo/bar.html to http://WWW.example.com/foo/bar.html

Has anyone done this?  Does this sound realistic/doable?
Is there a better/faster/easier way?
  e.g. changing URLs immediately at fetch/parse/index time?
  e.g. changing URLs on the fly at search time when displaying results?

There is another option - when fetching configure nutch to use a URLrewriting proxy, which will rewrite on the fly your requests ofwww.example.com to staging.example.com, get the response, and return thecontent - the only thing to do then would be to rewrite absoluteoutlinks contained in the content, from staging to www - but this can bedone in URLNormalizers.


--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: Crawling + Indexing staging vs. production and URL conflict

Reply via email to