According to Lachlan Andrew: > In hts_templates.html, it explains the difference between $(VAR) > ["normal"], $%(VAR) ["escaped for use in a URL"] and $&(VAR) > ["HTML-escaped"]. > > a) Why do the hyperlinks in short.html and long.html not use > $%(URL)? From memory, spaces get coded correctly. I assume it > is done explicitly in the code, since "URL" always codes a URL... > Should we put comments to that effect in {short,long}.html, in > case people copy them for use as their own templates?
htdig doesn't hex-decode the URLs that it reads from the HTML pages, so the assumption is they were already properly encoded and shouldn't be re-encoded by htsearch. Things could break if htdig/htsearch started second-guessing the encoding of URLs in pages it indexes and doubly encoded them. E.g.: a "+" in a query string might be used properly to encode a space, and if htsearch then hex-encoded it, it would no longer encode that space but rather a literal "+". > b) Does the fact that EXCERPT is not HTML-escaped pose a potential > security risk? Punctuation is stripped from EXCERPT, if < and > > are "extra word characters", this could cause problems. This is related to what I was trying to explain to Neal last week, but somehow we just didn't seem to be understanding each other and he seemed to completely miss my point. htsearch already does partially SGML-encode this variable, piece by piece. The issue with this string is that it contains not only stuff from the original web page, which htdig has already SGML-decoded, but it also contains some HTML tags that htsearch inserts, e.g. to highlight found words and to link to anchors, so you can't SGML-encode the whole string. What htsearch used to do was SGML-encode the string internally, before inserting the necessary HTML tags. This worked OK, but when searching this string for matching words, it missed the matching words if they contained SGML entities. What I did was to leave the whole string unencoded for the search, but when htsearch builds up the EXCERPT variable from this string and the HTML tags it adds, it now SGML-encodes the bits and pieces of the string that it takes, one at a time, and inserts these between the HTML tags it adds. The end result is the same as before, but words are highlighted even if they end up having SGML-encoded parts to them. By the way, punctuation is not stripped from EXCERPT -- only the original HTML tags from the source page are. -- Gilles R. Detillieux E-mail: <[EMAIL PROTECTED]> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/ Dept. Physiology, U. of Manitoba Winnipeg, MB R3E 3J7 (Canada) ------------------------------------------------------- The SF.Net email is sponsored by EclipseCon 2004 Premiere Conference on Open Tools Development and Integration See the breadth of Eclipse activity. February 3-5 in Anaheim, CA. http://www.eclipsecon.org/osdn _______________________________________________ ht://Dig Developer mailing list: [EMAIL PROTECTED] List information (subscribe/unsubscribe, etc.) https://lists.sourceforge.net/lists/listinfo/htdig-dev