aseek-devel  

Re: [aseek-devel] Re: [cvs commit]: User matt, module aspseek

Matt Sullivan
Tue, 24 Sep 2002 01:49:19 -0700

Hi Kir,

On Fri, 20 Sep 2002, Kir Kolyshkin wrote:

> Matt, regarding your last changes (addition of TranslateToUri).
> Can you please show us HTML document with URLs formed using

This is what I am refering to:

  * http://www.htdig.org/htdig-dev/1999/07/0132.html
  * http://www.leekillough.com/robots.html
    - see section titled "Catch Sloppy SGML/HTML Parsing"

Currently ASPseek stores URLs and forms request URIs blind.  It neither decodes
SGML entities as per the html4.01 spec[1] nor does it ensure that URIs are
properly character encoded when forming the request URI.

The way I'd like to implement this is:

Pre-URL storage:

SGML entities should certainly be decoded[1].  A generally acceptable approach
would be to decode all Latin1 entities directly, recode all entities having
ordinal value > 255 to UTF-8 and character encode[2] the resulting bytes and
finally, leave all unrecognised entities unmodified within the URI.

Pre-Request:

The pre-request process would have to ensure that characters which require
character encoding according to the BNF defined in RFC1738[3] are character
encoded prior to executing the request.  This would, of course, include any
remaining non-ASCII characters[2] produced prior to URL storage.

 1. <URL:http://www.w3.org/TR/1999/REC-html401-19991224/types.html#type-cdata>
 2. <URL:http://www.w3.org/TR/html4/appendix/notes.html#non-ascii-chars>
 3. <URL:http://www.ietf.org/rfc/rfc1738.txt>


> SGML entities? Also, have you noted that SgmlToChar returns
> WORD, not char, and if there will be, say, #1234; sequence,
> result will not be fitted to char that you use to hold result.

Yes, addimitedly I overlooked this :)  The definitions above will allow that to
be easily resolved however.

Thoughts?


Matt.