According to Jake Baillie: > I have an evil application that's inserting a session ID (yadda, yadda, > we've heard it all before). > > So, I put together a rewrite rule: > > url_rewrite_rules: (.*)\\?BV_SessionID=(.*)\\&(.*) \\1?\\3 > > Now, I'm actually using htdig not as a search engine here, but merely a > spider. I'm using the -t option to output a text list of URLs, and I'm > going to take that list and do something else with it. > > What I want to happen is: > > >http://www.domain.com/something.jsp?BV_SessionID=24324234234&other=yadda¶mater=stupid > > to rewrite to: > > http://www.domain.com/something.jsp?other=yadda¶meter=stupid > > when it enters the database (and writes that db.log text file). This is > happening, as it stands, with my rule above. When I do htdig -vvv, I can > see the normalization being done. Good. > > The problem - it seems to be taking the links off of the page it retrieves > (reading into the anchor tags), and normalizing them too, instead of just > following them verbatim from the page and translating them later. This is a > problem, because the site cannot be traversed without the session id on the > line (I know, I didn't design it), but I need it to go away when the page > is included in the database, because I might have to stop and restart htdig > before the site is fully traversed, and the session ids expire after 60 > minutes. And htdig doesn't know a page is duplicated if the session id is > different. > > See the problem? :) If not, I can clarify. If so, suggestions are > appreciated. :) > > Please hit reply all, as I'm not subscribed to the list.
OK, this application is a bit more evil than the other session-ID-inserting applications we've heard all about before. With most of these, the session ID can be safely omitted before the URL is fetched. Unfortunately, htdig processes url_rewrite_rules before fetching the URL - it really almost has to, as it needs to know if this is a new URL or not before fetching it. What you're asking for is for htdig to process url_rewrite_rules only for the purpose of determining if the URL has been visited or not, but that it keeps the session ID for when it fetches the URL. Even that won't be good enough, though. If I understand correctly, the session ID MUST be there in the URL or you can't access the document, plus, if the session ID has expired you also can no longer access the document until you get a fresh session ID. So, how can you possibly get htsearch to return URLs with a useable session ID so that the search results actually lead to something you can fetch? In your position, I think I'd find the programmer of this evil application and slap him about the head until he agrees to right something that's more search-engine-friendly. -- Gilles R. Detillieux E-mail: <[EMAIL PROTECTED]> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/ Dept. Physiology, U. of Manitoba Winnipeg, MB R3E 3J7 (Canada) ------------------------------------------------------- This sf.net email is sponsored by:ThinkGeek Welcome to geek heaven. http://thinkgeek.com/sf _______________________________________________ htdig-general mailing list <[EMAIL PROTECTED]> To unsubscribe, send a message to <[EMAIL PROTECTED]> with a subject of unsubscribe FAQ: http://htdig.sourceforge.net/FAQ.html

