There is not, and I think this is a major flaw throughout a lot of the Racket net libraries. I have gone through many efforts to use bytes throughout the Web server to avoid issues like this, but the URL module is one place where it hurts.
I think it should be written to use bytes internally and provide the UTF-8 string versions for compatibility. Unfortunately, I don't have the time to fix it now. Jay On Sat, Sep 24, 2011 at 3:56 PM, Rodolfo Carvalho <[email protected]> wrote: > Hello, > I'm running a (simple) web scrapper in a page written in iso-8859-1 > (declared in source using a meta tag). > The page contains links like this: > "http://www.ufrj.br/editais.php?tp=Acad%EAmicos&no=Cursos&idtp=4" > In one point the code calls: > (combine-url/relative current-url resource) > Where current-url is a Racket URL and resource is the aforementioned string. > I then get the error: > bytes->string/utf-8: string is not a well-formed UTF-8 encoding: > #"Acad\352micos" > > This seems to be a problem with uri-decode. > (uri-decode resource) > bytes->string/utf-8: string is not a well-formed UTF-8 encoding: > #"http://www.ufrj.br/editais.php?tp=Acad\352micos&no=Cursos&idtp=4" > > I looked at the source code of uri-decode to see that after decoding the > percent encoded string, a call to bytes->string/utf-8 expects the string to > be UTF-8 encoded... but there's no way to tell uri-decode to use a different > encoding. > I copied the relevant portion of code from uri-codec-unit.rkt from the > collects/net, and verified that I can change bytes->string/utf-8 > => bytes->string/latin-1 and get it to work... but that's like cheating :) > AFAICT Chrome and Firefox handles the > URL "http://www.ufrj.br/editais.php?tp=Acad%EAmicos&no=Cursos&idtp=4" as > well as it's UTF-8 %-encoded > equivalent "http://www.ufrj.br/editais.php?tp=Acad%C3%AAmicos&no=Cursos&idtp=4", > with the difference that the second appears as > "http://www.ufrj.br/editais.php?tp=Acadêmicos&no=Cursos&idtp=4" (but when > copied->pasted is still %C3%AA instead of ê). > > How could I make uri-decode understand an encoding other than UTF-8? > > Thanks, > Rodolfo Carvalho > > _________________________________________________ > For list-related administrative tasks: > http://lists.racket-lang.org/listinfo/users > -- Jay McCarthy <[email protected]> Assistant Professor / Brigham Young University http://faculty.cs.byu.edu/~jay "The glory of God is Intelligence" - D&C 93 _________________________________________________ For list-related administrative tasks: http://lists.racket-lang.org/listinfo/users

