Matt Sullivan
Tue, 24 Sep 2002 01:49:19 -0700
Hi Kir, On Fri, 20 Sep 2002, Kir Kolyshkin wrote: > Matt, regarding your last changes (addition of TranslateToUri). > Can you please show us HTML document with URLs formed using This is what I am refering to: * http://www.htdig.org/htdig-dev/1999/07/0132.html * http://www.leekillough.com/robots.html - see section titled "Catch Sloppy SGML/HTML Parsing" Currently ASPseek stores URLs and forms request URIs blind. It neither decodes SGML entities as per the html4.01 spec[1] nor does it ensure that URIs are properly character encoded when forming the request URI. The way I'd like to implement this is: Pre-URL storage: SGML entities should certainly be decoded[1]. A generally acceptable approach would be to decode all Latin1 entities directly, recode all entities having ordinal value > 255 to UTF-8 and character encode[2] the resulting bytes and finally, leave all unrecognised entities unmodified within the URI. Pre-Request: The pre-request process would have to ensure that characters which require character encoding according to the BNF defined in RFC1738[3] are character encoded prior to executing the request. This would, of course, include any remaining non-ASCII characters[2] produced prior to URL storage. 1. <URL:http://www.w3.org/TR/1999/REC-html401-19991224/types.html#type-cdata> 2. <URL:http://www.w3.org/TR/html4/appendix/notes.html#non-ascii-chars> 3. <URL:http://www.ietf.org/rfc/rfc1738.txt> > SGML entities? Also, have you noted that SgmlToChar returns > WORD, not char, and if there will be, say, #1234; sequence, > result will not be fitted to char that you use to hold result. Yes, addimitedly I overlooked this :) The definitions above will allow that to be easily resolved however. Thoughts? Matt.