On 02/20/2013 05:59 PM, Ondrej Oprala wrote:
> On 02/20/2013 12:21 AM, Bernhard Voelker wrote:
>> On 02/18/2013 04:30 PM, Ondrej Oprala wrote:
>>> Hi, I've been working on multibyte support for the {un,}expand utilities
>>> lately, my approach being similar to Padraig's from 2010 (
>>> http://lists.gnu.org/archive/html/coreutils/2010-09/msg00029.html ) .
>>> Both tools now read by lines, not bytes, and then iterate over the
>>> characters properly.
>> Thanks - the package maintainer of all distributions will be very happy
>> once multi-byte support is integrated to all upstream coreutils programs.
>>
>> I can't tell too much about your patch yet - I even tried to avoid
>> the umlaut-ö in my own name whenever possible my whole life ;-)
>> So here are only a few general notes.
>>
>> I had a bit problems to compile because expand-core.h needs the types
>> uintmax_t, uint32_t and uint8_t which are not yet defined.
>> Moving the #include down (in src/{expand,expand-core,unexpand}.c)
>> worked around this problem, e.g. expand.c:
>>
>> --- a/src/expand.c
>> +++ b/src/expand.c
>> @@ -46,13 +46,13 @@
>> # include <unistring/localcharset.h>
>> #endif
>>
>> -#include "expand-core.h"
>> -
>> #include "system.h"
>> #include "error.h"
>> #include "fadvise.h"
>> #include "xstrndup.h"
>>
>> +#include "expand-core.h"
>> +
>> /* The official name of this program (e.g., no 'g' prefix). */
>> #define PROGRAM_NAME "expand"
>>
>>
>> In the expand() and unexpand() functions, the variable rawline can
>> be changed from char** to char*. This makes the code better readable.
>> I'm not sure whether we should free the rawline buffer on any input
>> line anyway, because it's filled by getline() and resized as needed;
>> i.e. reusing malloc'ed space is not a bad thing.
>>
>> And instead of strdup() without checking, we should use xstrdup():
>> {
>> - line = (uint8_t *) strdup (*rawline);
>> + line = (uint8_t *) xstrdup (rawline);
>> clen = 1;
>> }
>> Hopefully, we can avoid that strdup in the final version, and just
>> use the buffer we already have.
>>
>> In the tests, please check the exit status of expand/unexpand,
>> and use our own compare function; e.g. tests/expand/mb.sh:
>>
>> -expand < in > out
>> -cmp out exp > /dev/null 2>&1 || fail=1
>> +expand < in > out || fail=1
>> +compare exp out || fail=1
>>
>> It would be good to have much more tests. I have the feeling
>> that the coverage rate of expand/unexpand wasn't too great
>> also in the existing tests - though the unexpand test looks
>> slightly better than the expand test.
>>
>> Have a nice day,
>> Berny
> Thank you both very much for looking at the patch,
> I realize the patch is a bit big and thus hard to review (for which I'm
> sorry), but
> with {un,}expand tools being so tightly coupled together, I can't imagine a
> different approach.
>
> I've made some fixes/additions.
> The patch now correctly handles characters with a display length > 1,
> The columns variable is now incremented by character unit length, not by 1
> implicitly.
> Unexpand now uses the buffer line + offset to output pending characters as
> well,
> which makes it easier to work with blanks such as ideo-space in the pending
> buffer.
> (special case when pending blanks with display length == 2 get into
> the pending_blank[] buffer is covered in tests.)
> Speaking of tests, I've added a few test cases (thank you again for the
> ideo-space
> reproducers ;) ).
> Tests now use the correct compare, header files are ordered properly and
> s/strdup/xstrdup/ in code.
The attached 5 patches fix lots of build issues.
The test passes as does make syntax-check.
I've not looked at the code yet.
thanks,
Pádraig.
>From cf704d8c1f20dbe26803237a16ebe233b9ea59de Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?P=C3=A1draig=20Brady?= <[email protected]>
Date: Wed, 6 Mar 2013 23:51:11 +0000
Subject: [PATCH 1/5] expand: fix compile errors and warnings...
...without HAVE_LIBUNISTRING
---
src/expand-core.c | 6 +++---
src/unexpand.c | 4 ++--
2 files changed, 5 insertions(+), 5 deletions(-)
diff --git a/src/expand-core.c b/src/expand-core.c
index 184bb70..14200e7 100644
--- a/src/expand-core.c
+++ b/src/expand-core.c
@@ -27,14 +27,14 @@
# include <unistring/localcharset.h>
#endif
-#include "expand-core.h"
-
#include "system.h"
#include "error.h"
#include "fadvise.h"
#include "quote.h"
#include "xstrndup.h"
+#include "expand-core.h"
+
extern size_t first_free_tab;
extern size_t n_tabs_allocated;
@@ -181,7 +181,7 @@ next_file (FILE *fp)
return NULL;
}
-extern uint32_t
+extern uint32_t _GL_ATTRIBUTE_PURE
mb_index (uint8_t *line, size_t *clen, ssize_t *dlen, size_t offt)
{
#if HAVE_LIBUNISTRING
diff --git a/src/unexpand.c b/src/unexpand.c
index 3c1d212..37becbf 100644
--- a/src/unexpand.c
+++ b/src/unexpand.c
@@ -48,13 +48,13 @@
# include <unistring/localcharset.h>
#endif
-#include "expand-core.h"
-
#include "system.h"
#include "error.h"
#include "fadvise.h"
#include "xstrndup.h"
+#include "expand-core.h"
+
/* The official name of this program (e.g., no 'g' prefix). */
#define PROGRAM_NAME "unexpand"
--
1.7.7.6
>From 3a452d70838421f563efe5ba09a6cbd499212063 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?P=C3=A1draig=20Brady?= <[email protected]>
Date: Thu, 7 Mar 2013 00:15:55 +0000
Subject: [PATCH 2/5] expand: add missing include for uc_width()
* src/expand-core.c: Include uniwidth.h.
---
src/expand-core.c | 1 +
1 files changed, 1 insertions(+), 0 deletions(-)
diff --git a/src/expand-core.c b/src/expand-core.c
index 14200e7..f298dcc 100644
--- a/src/expand-core.c
+++ b/src/expand-core.c
@@ -24,6 +24,7 @@
# include <unistr.h>
# include <uniconv.h>
# include <unistdio.h>
+# include <uniwidth.h>
# include <unistring/localcharset.h>
#endif
--
1.7.7.6
>From 498ee20fe3ab86d9fa4c3754ffd3aef657a12786 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?P=C3=A1draig=20Brady?= <[email protected]>
Date: Thu, 7 Mar 2013 01:58:16 +0000
Subject: [PATCH 3/5] build: properly integrate libunstring
Details for this change are at:
http://lists.gnu.org/archive/html/bug-libunistring/2010-09/msg00003.html
It serves two purposes.
1. To avoid the dynamic linking overhead for printf
2. To ensure u8_uctomb_aux is available to the libcoreutils.a link
* configure.ac: Ensure we use the gnulib function, and not the
one form the system installed libunistring shared lib.
---
configure.ac | 4 ++++
1 files changed, 4 insertions(+), 0 deletions(-)
diff --git a/configure.ac b/configure.ac
index 79bc920..b164c06 100644
--- a/configure.ac
+++ b/configure.ac
@@ -374,6 +374,10 @@ fi
gl_WINSIZE_IN_PTEM
gl_LIBUNISTRING
+# Ensure that u8_uctomb_aux is compiled from gnulib not libunistring
+# so that `printf` doesn't need to link with libunistring,
+# which was seen to incur a 16% startup overhead.
+AM_CONDITIONAL([LIBUNISTRING_COMPILE_UNISTR_U8_UCTOMB], [true])
gl_HEADER_TIOCGWINSZ_IN_TERMIOS_H
--
1.7.7.6
>From feacd6580121f1a2aa839cf61a571a5127f30464 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?P=C3=A1draig=20Brady?= <[email protected]>
Date: Thu, 7 Mar 2013 03:08:32 +0000
Subject: [PATCH 4/5] expand: fix syntax-check errors
* src/expand-core.h: Indentation.
---
src/expand-core.c | 14 --------------
src/expand-core.h | 16 +++++++++++++++-
2 files changed, 15 insertions(+), 15 deletions(-)
diff --git a/src/expand-core.c b/src/expand-core.c
index f298dcc..f0d5b73 100644
--- a/src/expand-core.c
+++ b/src/expand-core.c
@@ -36,20 +36,6 @@
#include "expand-core.h"
-extern size_t first_free_tab;
-
-extern size_t n_tabs_allocated;
-
-extern uintmax_t *tab_list;
-
-extern int exit_status;
-
-extern char **file_list;
-
-extern bool have_read_stdin;
-
-extern bool u8_locale;
-
/* Add the comma or blank separated list of tab stops STOPS
to the list of tab stops. */
diff --git a/src/expand-core.h b/src/expand-core.h
index c3fb44c..0314ae2 100644
--- a/src/expand-core.h
+++ b/src/expand-core.h
@@ -15,7 +15,21 @@
along with this program. If not, see <http://www.gnu.org/licenses/>. */
#ifndef EXP_COMMON_H_
-#define EXP_COMMON_H_
+# define EXP_COMMON_H_
+
+extern size_t first_free_tab;
+
+extern size_t n_tabs_allocated;
+
+extern uintmax_t *tab_list;
+
+extern int exit_status;
+
+extern char **file_list;
+
+extern bool have_read_stdin;
+
+extern bool u8_locale;
void
parse_tab_stops (char const *stops);
--
1.7.7.6
>From e12f09aea341dec884f40910bf2076f2bf32ae33 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?P=C3=A1draig=20Brady?= <[email protected]>
Date: Thu, 7 Mar 2013 03:29:50 +0000
Subject: [PATCH 5/5] expand: make test characters more explicit
* tests/expand/mb.sh: Use abstract printf representations
of the various multibyte characters under test, rather
than entering directly, which is clearer and avoids
potential editing issues.
---
tests/expand/mb.sh | 36 +++++++++++++++++-------------------
1 files changed, 17 insertions(+), 19 deletions(-)
diff --git a/tests/expand/mb.sh b/tests/expand/mb.sh
index 136f6d1..450faef 100755
--- a/tests/expand/mb.sh
+++ b/tests/expand/mb.sh
@@ -21,17 +21,17 @@ print_ver_ expand
export LC_ALL=en_US.UTF-8
#input containing multibyte characters
-cat > in <<\EOF
+cat <<\EOF > in || framework_failure_
1234567812345678123456781
. . . .
a b c d
. . . .
ä ö ü Ã
. . . .
- äöü . öüä. ä xx
EOF
+env printf ' äöü\t. öüä. \tä xx\n' >> in || framework_failure_
-cat > exp <<\EOF
+cat <<\EOF > exp || framework_failure_
1234567812345678123456781
. . . .
a b c d
@@ -44,24 +44,22 @@ EOF
expand < in > out || fail=1
compare exp out > /dev/null 2>&1 || fail=1
-#test characters with a display width larger than 1
-cat > in <<\EOF
-12345678
-e |ascii(1)
-é |composed(1)
-eÌ |decomposed(1)
-ã |ideo-space(2)
-ï¼ |full-hypen(2)
-EOF
+#test characters with display widths != 1
+env printf '12345678
+e\t|ascii(1)
+\u00E9\t|composed(1)
+e\u0301\t|decomposed(1)
+\u3000\t|ideo-space(2)
+\uFF0D\t|full-hypen(2)
+' > in || framework_failure_
-cat > exp <<\EOF
-12345678
+env printf '12345678
e |ascii(1)
-é |composed(1)
-eÌ |decomposed(1)
-ã |ideo-space(2)
-ï¼ |full-hypen(2)
-EOF
+\u00E9 |composed(1)
+e\u0301 |decomposed(1)
+\u3000 |ideo-space(2)
+\uFF0D |full-hypen(2)
+' > exp || framework_failure_
expand < in > out || fail=1
compare exp out > /dev/null 2>&1 || fail=1
--
1.7.7.6