According to Lachlan Andrew:
> In  hts_templates.html,  it explains the difference between $(VAR) 
> ["normal"], $%(VAR) ["escaped for use in a URL"] and $&(VAR) 
> ["HTML-escaped"].
> 
> a) Why do the hyperlinks in  short.html  and  long.html  not use
>    $%(URL)?  From memory, spaces get coded correctly.  I assume it
>    is done explicitly in the code, since "URL" always codes a URL...
>    Should we put comments to that effect in  {short,long}.html, in
>    case people copy them for use as their own templates?

htdig doesn't hex-decode the URLs that it reads from the HTML pages, so
the assumption is they were already properly encoded and shouldn't be
re-encoded by htsearch.  Things could break if htdig/htsearch started
second-guessing the encoding of URLs in pages it indexes and doubly
encoded them.  E.g.: a "+" in a query string might be used properly
to encode a space, and if htsearch then hex-encoded it, it would no
longer encode that space but rather a literal "+".

> b) Does the fact that EXCERPT is not HTML-escaped pose a potential
>    security risk?  Punctuation is stripped from EXCERPT, if < and >
>    are "extra word characters", this could cause problems.

This is related to what I was trying to explain to Neal last week, but
somehow we just didn't seem to be understanding each other and he seemed
to completely miss my point.  htsearch already does partially SGML-encode
this variable, piece by piece.

The issue with this string is that it contains not only stuff from the
original web page, which htdig has already SGML-decoded, but it also
contains some HTML tags that htsearch inserts, e.g. to highlight found
words and to link to anchors, so you can't SGML-encode the whole string.
What htsearch used to do was SGML-encode the string internally, before
inserting the necessary HTML tags.  This worked OK, but when searching
this string for matching words, it missed the matching words if they
contained SGML entities.

What I did was to leave the whole string unencoded for the search, but
when htsearch builds up the EXCERPT variable from this string and the HTML
tags it adds, it now SGML-encodes the bits and pieces of the string that
it takes, one at a time, and inserts these between the HTML tags it adds.
The end result is the same as before, but words are highlighted even if
they end up having SGML-encoded parts to them.

By the way, punctuation is not stripped from EXCERPT -- only the original
HTML tags from the source page are.

-- 
Gilles R. Detillieux              E-mail: <[EMAIL PROTECTED]>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/
Dept. Physiology, U. of Manitoba  Winnipeg, MB  R3E 3J7  (Canada)


-------------------------------------------------------
The SF.Net email is sponsored by EclipseCon 2004
Premiere Conference on Open Tools Development and Integration
See the breadth of Eclipse activity. February 3-5 in Anaheim, CA.
http://www.eclipsecon.org/osdn
_______________________________________________
ht://Dig Developer mailing list:
[EMAIL PROTECTED]
List information (subscribe/unsubscribe, etc.)
https://lists.sourceforge.net/lists/listinfo/htdig-dev

Reply via email to