[EMAIL PROTECTED] wrote:
Hi,I'm using Nutch to index a single site. I have a need to crawl/fetch/index the staging version of the site and then using the resulting index for searching of the production site. The problem is that staging and production sites have different URLs, for example: Staging: http://STAGING.example.com/foo/bar.html Production: http://WWW.example.com/foo/bar.htmlWhat I'd like to be able to do do is index the staging site and then just push the index to production and have it work for production searches. Obviously, the links stored in the index would be wrong (STAGING.example.com vs. WWW.example.com). What is the best way to accomplish this? One thing I was thinking was to index the staging site, then open up CrawlDb and LinkDb (any others?), loop through them and write out a new version of those files, changing the keys (URLs) along the way, for instance from http://STAGING.example.com/foo/bar.html to http://WWW.example.com/foo/bar.html Has anyone done this? Does this sound realistic/doable? Is there a better/faster/easier way? e.g. changing URLs immediately at fetch/parse/index time? e.g. changing URLs on the fly at search time when displaying results?
There is another option - when fetching configure nutch to use a URL rewriting proxy, which will rewrite on the fly your requests of www.example.com to staging.example.com, get the response, and return the content - the only thing to do then would be to rewrite absolute outlinks contained in the content, from staging to www - but this can be done in URLNormalizers.
-- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __________________________________ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
