Re: [htdig] rewriting question: when does it happen?

Gilles Detillieux Wed, 02 Oct 2002 13:26:24 -0700

According to Jake Baillie:
> I have an evil application that's inserting a session ID (yadda, yadda, 
> we've heard it all before).
> 
> So, I put together a rewrite rule:
> 
> url_rewrite_rules: (.*)\\?BV_SessionID=(.*)\\&(.*) \\1?\\3
> 
> Now, I'm actually using htdig not as a search engine here, but merely a 
> spider. I'm using the -t option to output a text list of URLs, and I'm 
> going to take that list and do something else with it.
> 
> What I want to happen is:
> 
> 
>http://www.domain.com/something.jsp?BV_SessionID=24324234234&other=yadda&paramater=stupid
> 
> to rewrite to:
> 
> http://www.domain.com/something.jsp?other=yadda&parameter=stupid
> 
> when it enters the database (and writes that db.log text file). This is 
> happening, as it stands, with my rule above. When I do htdig -vvv, I can 
> see the normalization being done. Good.
> 
> The problem - it seems to be taking the links off of the page it retrieves 
> (reading into the anchor tags), and normalizing them too, instead of just 
> following them verbatim from the page and translating them later. This is a 
> problem, because the site cannot be traversed without the session id on the 
> line (I know, I didn't design it), but I need it to go away when the page 
> is included in the database, because I might have to stop and restart htdig 
> before the site is fully traversed, and the session ids expire after 60 
> minutes. And htdig doesn't know a page is duplicated if the session id is 
> different.
> 
> See the problem? :) If not, I can clarify. If so, suggestions are 
> appreciated. :)
> 
> Please hit reply all, as I'm not subscribed to the list.


OK, this application is a bit more evil than the other session-ID-inserting
applications we've heard all about before.  With most of these, the session
ID can be safely omitted before the URL is fetched.  Unfortunately, htdig
processes url_rewrite_rules before fetching the URL - it really almost has
to, as it needs to know if this is a new URL or not before fetching it.

What you're asking for is for htdig to process url_rewrite_rules only
for the purpose of determining if the URL has been visited or not, but
that it keeps the session ID for when it fetches the URL.  Even that
won't be good enough, though.

If I understand correctly, the session ID MUST be there in the URL or you
can't access the document, plus, if the session ID has expired you also
can no longer access the document until you get a fresh session ID.  So,
how can you possibly get htsearch to return URLs with a useable session
ID so that the search results actually lead to something you can fetch?

In your position, I think I'd find the programmer of this evil application
and slap him about the head until he agrees to right something that's more
search-engine-friendly.

-- 
Gilles R. Detillieux              E-mail: <[EMAIL PROTECTED]>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/
Dept. Physiology, U. of Manitoba  Winnipeg, MB  R3E 3J7  (Canada)


-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf
_______________________________________________
htdig-general mailing list <[EMAIL PROTECTED]>
To unsubscribe, send a message to <[EMAIL PROTECTED]> with a 
subject of unsubscribe
FAQ: http://htdig.sourceforge.net/FAQ.html

Re: [htdig] rewriting question: when does it happen?

Reply via email to