On Tuesday, April 16, 2002, at 03:00  PM, Soriana Villanueva wrote:

> I am trying to index a site that has a couple of pages being redirected 
> and
> am trying to preserve the address that htdig sees and not the one the
> redirectservlet returns.

> When one clicks on the above address and the page loads, the address 
> reverts
> to http://mydomain.com/index.jsp?pageid=4006.

This seems like rather strange site design, IMHO (i.e. I cannot bookmark 
any pages?) but I digress for the moment.

> I was able to dump out an url list with the url_list attribute and 
> everything seems fine.

No, there's a distinction here. The URL list includes *all* URLs that 
are "seen" when indexing. This includes invalid URLs, broken pages and 
for your case URLs that are redirected.

I'm not exactly clear on how your site is working--it sounds like the 
servlet is sending a redirect to the browser, but somehow a different 
page comes up WITH THE SAME URL. The key here is that htdig is doing 
exactly what it's obligated to do by web standards as it receives a 
redirect, it changes the URL.

Nothing you can do with url_rewrite_rules is going to change this 
behavior. You could certainly modify the code in htdig/Retriever.cc 
(got_redirect), but IMHO, this isn't necessarily a great idea. 
Personally, I'd question why the servlet is working this way and whether 
it can't ignore the Redirect: header, especially when the User-agent: is 
htdig.

<http://www.htdig.org/attrs.html#user_agent>

--
-Geoff Hutchison
Williams Students Online
http://wso.williams.edu/


_______________________________________________
htdig-general mailing list <[EMAIL PROTECTED]>
To unsubscribe, send a message to <[EMAIL PROTECTED]> with a 
subject of unsubscribe
FAQ: http://htdig.sourceforge.net/FAQ.html

Reply via email to