Hi,
I'm using Nutch to index a single site. I have a need to crawl/fetch/index the
staging version of the site and then using the resulting index for searching of
the production site. The problem is that staging and production sites have
different URLs, for example:
Staging:
http://STAGING.example.com/foo/bar.html
Production:
http://WWW.example.com/foo/bar.html
What I'd like to be able to do do is index the staging site and then just push
the index to production and have it work for production searches. Obviously,
the links stored in the index would be wrong (STAGING.example.com vs.
WWW.example.com).
What is the best way to accomplish this?
One thing I was thinking was to index the staging site, then open up CrawlDb
and LinkDb (any others?), loop through them and write out a new version of
those files, changing the keys (URLs) along the way, for instance from
http://STAGING.example.com/foo/bar.html to http://WWW.example.com/foo/bar.html
Has anyone done this? Does this sound realistic/doable?
Is there a better/faster/easier way?
e.g. changing URLs immediately at fetch/parse/index time?
e.g. changing URLs on the fly at search time when displaying results?
Thanks,
Otis
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Simpy -- http://www.simpy.com/ - Tag - Search - Share