Hi, Ludovic Courtès <[email protected]> writes:
> Hi Timothy, > > Timothy Sample <[email protected]> skribis: > >> A quick reading of RFC 3986 suggests that the host part of a URI can be >> an IP address (version 4 or 6) or a registered name. It gives the >> following rules for registered names: >> >> reg-name = *( unreserved / pct-encoded / sub-delims ) >> unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~" >> pct-encoded = "%" HEXDIG HEXDIG >> sub-delims = "!" / "$" / "&" / "'" / "(" / ")" >> / "*" / "+" / "," / ";" / "=" >> >> Here, “ALPHA”, “DIGIT”, and “HEXDIG” are specified in RFC 2234, and are >> just the ASCII ranges you might expect (except for that “HEXDIG” only >> allows uppercase letters). > > Do you think you could turn that into a patch for Guile? I’d happily > apply it. :-) > > It looks like both [[:alnum:]] & co. and ranges would be > locale-dependent, so my understanding is that we’ll have to list all the > characters explicitly, right? Here’s a patch for Guile that uses explicit lists of characters in the ‘(web uri)’ module instead of character ranges. It includes two tests that are pretty verbose, but seem to do the trick. I have a bit more background on the problem, mostly coming from a Glibc bug report: <https://sourceware.org/bugzilla/show_bug.cgi?id=23393>. It turns out that it is well-known upstream, and avoiding character ranges is the recommended approach for know. Some other GNU tools have adopted what is being called the “Rational Range Interpretation” <https://www.gnu.org/software/gawk/manual/html_node/Ranges-and-Locales.html>. AIUI, this means they use the underlying encoding numbers for ranges (I checked the source, but I’m only mostly sure I read it right). It looks like the Glibc folks are unsure how to proceed on this (but are maybe slightly leaning towards the “rational” approach). It’s all a pretty big mess, really. I was hoping there would be some obvious thing that would fix the problem more generally. Short of pulling in the Gnulib regex code or writing something in Scheme, it looks like Guile is stuck where it is now. I’m unsure if the changes are considered “trivial” from a copyright perspective. It’s pretty close, but I think programmers tend to underestimate here. I’ve started the FSF copyright assignment process either way, since is likely not my last Guile patch. :) -- Tim
>From 7b02be4c050c7b17a0e2685e8e453295f798c360 Mon Sep 17 00:00:00 2001 From: Timothy Sample <[email protected]> Date: Sun, 2 Jun 2019 14:41:20 -0400 Subject: [PATCH] Make URI handling locale independent. Fixes <https://bugs.gnu.org/35785>. * module/web/uri.scm (digits, hex-digits, letters): New variables. (ipv4-regexp, ipv6-regexp, domain-label-regexp, top-label-regexp, userinfo-pat, host-pat, ipv6-host-pat, port-pat, scheme-pat): Explicitly list each character instead of using character ranges. * test-suite/tests/web-uri.test: Add corresponding tests. --- module/web/uri.scm | 31 +++++++++++++++++++++---------- test-suite/tests/web-uri.test | 29 ++++++++++++++++++++++++++--- 2 files changed, 47 insertions(+), 13 deletions(-) diff --git a/module/web/uri.scm b/module/web/uri.scm index 4c6fa5051..b4b89b9cc 100644 --- a/module/web/uri.scm +++ b/module/web/uri.scm @@ -1,6 +1,6 @@ ;;;; (web uri) --- URI manipulation tools ;;;; -;;;; Copyright (C) 1997,2001,2002,2010,2011,2012,2013,2014 Free Software Foundation, Inc. +;;;; Copyright (C) 1997,2001,2002,2010,2011,2012,2013,2014,2019 Free Software Foundation, Inc. ;;;; ;;;; This library is free software; you can redistribute it and/or ;;;; modify it under the terms of the GNU Lesser General Public @@ -175,17 +175,28 @@ for ‘build-uri’ except there is no scheme." ;;; Converters. ;;; +;; Since character ranges in regular expressions may depend on the +;; current locale, we use explicit lists of characters instead. See +;; <https://bugs.gnu.org/35785> for details. +(define digits "0123456789") +(define hex-digits "0123456789ABCDEFabcdef") +(define letters "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz") + ;; See RFC 3986 #3.2.2 for comments on percent-encodings, IDNA (RFC ;; 3490), and non-ASCII host names. ;; (define ipv4-regexp - (make-regexp "^([0-9.]+)$")) + (make-regexp (string-append "^([" digits ".]+)$"))) (define ipv6-regexp - (make-regexp "^([0-9a-fA-F:.]+)$")) + (make-regexp (string-append "^([" hex-digits ":.]+)$"))) (define domain-label-regexp - (make-regexp "^[a-zA-Z0-9]([a-zA-Z0-9-]*[a-zA-Z0-9])?$")) + (make-regexp + (string-append "^[" letters digits "]" + "([" letters digits "-]*[" letters digits "])?$"))) (define top-label-regexp - (make-regexp "^[a-zA-Z]([a-zA-Z0-9-]*[a-zA-Z0-9])?$")) + (make-regexp + (string-append "^[" letters "]" + "([" letters digits "-]*[" letters digits "])?$"))) (define (valid-host? host) (cond @@ -203,13 +214,13 @@ for ‘build-uri’ except there is no scheme." (regexp-exec top-label-regexp host start))))))) (define userinfo-pat - "[a-zA-Z0-9_.!~*'();:&=+$,-]+") + (string-append "[" letters digits "_.!~*'();:&=+$,-]+")) (define host-pat - "[a-zA-Z0-9.-]+") + (string-append "[" letters digits ".-]+")) (define ipv6-host-pat - "[0-9a-fA-F:.]+") + (string-append "[" hex-digits ":.]+")) (define port-pat - "[0-9]*") + (string-append "[" digits "]*")) (define authority-regexp (make-regexp (format #f "^//((~a)@)?((~a)|(\\[(~a)\\]))(:(~a))?$" @@ -246,7 +257,7 @@ for ‘build-uri’ except there is no scheme." ;;; either. (define scheme-pat - "[a-zA-Z][a-zA-Z0-9+.-]*") + (string-append "[" letters "][" letters digits "+.-]*")) (define authority-pat "[^/?#]*") (define path-pat diff --git a/test-suite/tests/web-uri.test b/test-suite/tests/web-uri.test index 73391898c..ef8e85eba 100644 --- a/test-suite/tests/web-uri.test +++ b/test-suite/tests/web-uri.test @@ -1,6 +1,6 @@ ;;;; web-uri.test --- URI library -*- mode: scheme; coding: utf-8; -*- ;;;; -;;;; Copyright (C) 2010-2012, 2014, 2017 Free Software Foundation, Inc. +;;;; Copyright (C) 2010-2012, 2014, 2017, 2019 Free Software Foundation, Inc. ;;;; ;;;; This library is free software; you can redistribute it and/or ;;;; modify it under the terms of the GNU Lesser General Public @@ -121,7 +121,18 @@ (pass-if-uri-exception "http://foo@" "Expected.*host" - (build-uri 'http #:userinfo "foo"))) + (build-uri 'http #:userinfo "foo")) + + (pass-if-uri-exception "http://illégal.com" + "Expected.*host" + (dynamic-wind + (lambda () #t) + (lambda () + (with-locale "en_US.utf8" + (reload-module (resolve-module '(web uri))) + (build-uri 'http #:host "illégal.com"))) + (lambda () + (reload-module (resolve-module '(web uri))))))) (with-test-prefix "build-uri-reference" (pass-if "//host/etc/foo" @@ -290,7 +301,19 @@ #:port 100 #:path "/" #:query "q" - #:fragment "bar"))) + #:fragment "bar")) + + ;; bug #35785 + (pass-if "http://www.example.com (sv_SE)" + (dynamic-wind + (lambda () #t) + (lambda () + (with-locale "sv_SE.utf8" + (reload-module (resolve-module '(web uri))) + (uri=? (string->uri "http://www.example.com") + #:scheme 'http #:host "www.example.com" #:path ""))) + (lambda () + (reload-module (resolve-module '(web uri))))))) (with-test-prefix "string->uri-reference" (pass-if "/foo" -- 2.21.0
