On Mon, Aug 11, 2014 at 8:12 PM, James Bowlin <[email protected]> wrote: > On Mon, Aug 11, 2014 at 08:02 PM, Denys Vlasenko said: >> Looks like you want to set CONFIG_UNICODE_SUPPORT=y and unset >> both of these latter options. This way, all busybox applets >> should always work in Unicode mode. > > That set up does not work for sed, although it sometimes works.
I just tried your sed example. busybox sed code uses regexp routines for s///. Those routines are part of libc. Therefore, they will be Unicode-aware only if you are using CONFIG_UNICODE_USING_LOCALE=y. I just built busybox against glibc with these options: CONFIG_LOCALE_SUPPORT=y CONFIG_UNICODE_SUPPORT=y CONFIG_UNICODE_USING_LOCALE=y # CONFIG_FEATURE_CHECK_UNICODE_IN_ENV is not set CONFIG_SUBST_WCHAR=63 CONFIG_LAST_SUPPORTED_WCHAR=4351 # CONFIG_UNICODE_COMBINING_WCHARS is not set # CONFIG_UNICODE_WIDE_WCHARS is not set # CONFIG_UNICODE_BIDI_SUPPORT is not set # CONFIG_UNICODE_NEUTRAL_TABLE is not set # CONFIG_UNICODE_PRESERVE_BROKEN is not set And I'm getting this: $ export LANG=en_US.UTF-8 $ echo ÀÀÀ | ./busybox sed 's/./x/g' | wc -c 4 > The most mysterious thing is that it sometimes works and what > I need to do to get it to work in /init in an initrd. You may have found a bug. bbox never runs setlocale() for init. According to git log, this behavior was there from the very beginning: commit e5dfced23a904d08afa5dcee190c3c3d845d9f50 Author: Eric Andersen <[email protected]> Date: Mon Apr 9 22:48:12 2001 +0000 Apply Vladimir's latest cleanup patch. ... ... +#ifdef BB_LOCALE_SUPPORT + if(getpid()!=1) /* Do not set locale for `init' */ + setlocale(LC_ALL, ""); +#endif This probably should be changed so that init is not special. As to your other cases, they are interesting too. For example, you noticed that ${#VAR} handling is buggy. Even on the above mentioned build, I get this $ export LANG=en_US.UTF-8 $ ./busybox sh /home/srcdevel/bbox/fix/busybox.4z $ a=ÀÀÀ; echo ${#a} 6 whereas "standard" shell gives 3. This is clearly a bug (or at least "incompatibility"). Please report each such bug separately. > The "wc -m" solution always works (for me) so my problem is solved. > But there is still a strange problem with sed and unicode that > Harald was able to reproduce. Which problem? There are so many mails in this thread... _______________________________________________ busybox mailing list [email protected] http://lists.busybox.net/mailman/listinfo/busybox
