All: I'm looking for a sanity check before I embark on something.
The web servers I want to index are visible to the public on port 80 via some load balancing equipment, but don't themselves listen on port 80. That is, from the public perspective they look like http://foo.bar.com/, but privately/internally, they're actually http://foo.bar.com:9000/ I need my nutch server to crawl via the internal/private port, not the port 80 public one. When people get search results, however, I want the urls in the results to reference the port 80 only (foo.bar.com) and not expose the internal port. I can crawl on the alternate port just fine, but it's exposed in the search results. It looks like BasicIndexingFilter is where the url is added as a field in the index. I'm thinking that I need to create my own variant that removes the unwanted port info before adding the url, but I'm afraid that it may cause problems when it's time to reindex things. An alternative might be to modify the url before displaying my search results. I'd prefer to be able to simply serve the results w/o having to change them -- seems like it would scale better that way. Any suggestions? Did I miss something? thx. -- jeff
