On Tuesday, April 16, 2002, at 03:00 PM, Soriana Villanueva wrote:
> I am trying to index a site that has a couple of pages being redirected > and > am trying to preserve the address that htdig sees and not the one the > redirectservlet returns. > When one clicks on the above address and the page loads, the address > reverts > to http://mydomain.com/index.jsp?pageid=4006. This seems like rather strange site design, IMHO (i.e. I cannot bookmark any pages?) but I digress for the moment. > I was able to dump out an url list with the url_list attribute and > everything seems fine. No, there's a distinction here. The URL list includes *all* URLs that are "seen" when indexing. This includes invalid URLs, broken pages and for your case URLs that are redirected. I'm not exactly clear on how your site is working--it sounds like the servlet is sending a redirect to the browser, but somehow a different page comes up WITH THE SAME URL. The key here is that htdig is doing exactly what it's obligated to do by web standards as it receives a redirect, it changes the URL. Nothing you can do with url_rewrite_rules is going to change this behavior. You could certainly modify the code in htdig/Retriever.cc (got_redirect), but IMHO, this isn't necessarily a great idea. Personally, I'd question why the servlet is working this way and whether it can't ignore the Redirect: header, especially when the User-agent: is htdig. <http://www.htdig.org/attrs.html#user_agent> -- -Geoff Hutchison Williams Students Online http://wso.williams.edu/ _______________________________________________ htdig-general mailing list <[EMAIL PROTECTED]> To unsubscribe, send a message to <[EMAIL PROTECTED]> with a subject of unsubscribe FAQ: http://htdig.sourceforge.net/FAQ.html

