Greetings, While debugging[0] an issue with Bobot++ (poor sneek!) aborting after calling scm_regexp_exec on any utf-8 strings I eventually realized that... the string was actually single-byte encoded internally. After taking that down the wrong path I eventually tested `regexp-exec' with a *valid* latin-1 string and that too aborted in `fixup_multibyte_match'.
I have attached a patch that I think is correct. Instead of unconditionally calling `fixup_multibyte_match' when wchar_t is available it instead checks if the scheme string being matched is actually a multibyte string. This permits applications that provide no string encoding and non-ascii strings to be matched. If you call `setlocale' with any locale things sort of work. In the case of "C" non-ascii characters are escaped upon read, and in the case of "latin1" `mbrlen' will not reject the char code (AFAICT, I'm not an expert in this area). Unfortunately this means I don't see an easy way to write a test for the suite--it only happens in the case where the locale is "C" and no port encoder is set. <http://paste.lisp.org/display/120245#5> is what I was going for and will show the bug if run by hand. I'm not entirely certain this is the *correct* solution, but I think it should be--it seems bad to abort() applications that uses regexeps but haven't set their locale yet! (My papers for Guile are on file AFAIK FWIW) [0] http://paste.lisp.org/display/120245
From 61900d7e93780dd9d7d6db02fe3ad07a72a8a45b Mon Sep 17 00:00:00 2001 From: Clinton Ebadi <clin...@unknownlamer.org> Date: Sat, 5 Mar 2011 23:44:23 -0500 Subject: [PATCH] 2011-03-05 Clinton Ebadi <clin...@unknownlamer.org> * libguile/regex-posix.c (scm_regexp_exec): Only fixup byte to character offset when the string is actually multibyte encoded. --- libguile/regex-posix.c | 2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/libguile/regex-posix.c b/libguile/regex-posix.c index 3423099..db76e36 100644 --- a/libguile/regex-posix.c +++ b/libguile/regex-posix.c @@ -305,7 +305,7 @@ SCM_DEFINE (scm_regexp_exec, "regexp-exec", 2, 2, 0, scm_to_int (flags)); #ifdef HAVE_WCHAR_H - if (!status) + if ((!status) && (scm_to_int (scm_string_bytes_per_char (substr)) > 1)) fixup_multibyte_match (matches, nmatches, c_str); #endif -- 1.6.6.1
-- Jessie: but today i was a nerd Jessie: i even read slashdot.
pgpqUHuTjg3LK.pgp
Description: PGP signature