feature request: error codes for 'rm'
Hi I'm quite surprised 'rm' does not return a error code for no such file, I would like to see at least error code 1 so I can use it in a shell script, additional error codes might also be nice. Regards, Danny Rawlins Romster http://crux.nu/Public/DannyRawlins ___ Bug-coreutils mailing list Bug-coreutils@gnu.org http://lists.gnu.org/mailman/listinfo/bug-coreutils
Re: feature request: error codes for 'rm'
Danny Rawlins wrote: Hi I'm quite surprised 'rm' does not return a error code for no such file, I would like to see at least error code 1 so I can use it in a shell script, additional error codes might also be nice. Regards, Danny Rawlins Romster http://crux.nu/Public/DannyRawlins Damn it sorry I made a mistake and I did not check properly sorry for wasting your time. Regards, Danny Rawlins ___ Bug-coreutils mailing list Bug-coreutils@gnu.org http://lists.gnu.org/mailman/listinfo/bug-coreutils
Problème sous linux
Bonjour. J'ai instalé Ubuntu sur mon pc en dual boot avec windows et j'ai des problèmes d'éceran ou de fréquence de raffraichissement. ce qui es bizarre, c'est qu'avant ça n'affectait que linux maintenant, ça vient même sous windows et parfois c'est très long. Je ne sia spas quoi faire. J'ai installé debian au lieu de ubuntu et ça continue Que faire? __ Do You Yahoo!? En finir avec le spam? Yahoo! Mail vous offre la meilleure protection possible contre les messages non sollicités http://mail.yahoo.fr Yahoo! Mail ___ Bug-coreutils mailing list Bug-coreutils@gnu.org http://lists.gnu.org/mailman/listinfo/bug-coreutils
Re: horrible utf-8 performace in wc
Pádraig Brady wrote: mbstowcs doesn't canonicalize equivalent multibyte sequences, and so therefore functions the same in this regard as our processing of each wide character separately. This could be considered a bug actually- i.e. should -m give the number of wide chars, or the number of multibyte chars? With the attached patch, `wc -m` gives 23 chars for both these lines. The behaviour of wc -m is specified by POSIX [1] to output the number of characters. And: LC_CTYPE Determine the locale for the interpretation of sequences of bytes of text data as characters (for example, single-byte as opposed to multi-byte characters in arguments and input files) and which characters are defined as white space characters. The definition of Character in [2] means a multibyte-character. IMO it cannot be interpreted to mean a glyph, or a grapheme cluster, or a screen column. Rather, it is the unit that is processed by a call to mbtowc [3] or mbrtowc [4]. As a consequence: - The number of characters is the same as the number of wide characters. - wc -m must output the number of characters. - In a Unicode locale, U00E9 is one character, and U0065U0301 is two characters, * even if they are canonically equivalent (because POSIX does not make reference to this concept), and * even if they render the same on the screen (because except for Curses, POSIX does not refer to the rendering of characters). If you want wc to count characters after canonicalization, then you can invent a new wc command-line option for it. But I would find it more useful to have a filter program that reads from standard input and writes the canonicalized output to standard output; that would be applicable in many more situations. Bruno [1] http://www.opengroup.org/susv3/utilities/wc.html [2] http://www.opengroup.org/susv3/basedefs/xbd_chap03.html [3] http://www.opengroup.org/susv3/functions/mbtowc.html [4] http://www.opengroup.org/susv3/functions/mbrtowc.html ___ Bug-coreutils mailing list Bug-coreutils@gnu.org http://lists.gnu.org/mailman/listinfo/bug-coreutils
Re: horrible utf-8 performace in wc
Is there a good library for combining-character canonicalization available? That seems like something that would be useful to have in a lot of text-processing tools. Also, for Unicode, something to shuffle between the normalization forms might be helpful for comparisons. Such functionality is currently available in IBM's ICU, in GNOME's libunicode, in Simon's libidn, and should be available in some time in gnulib. Please contact me if you want to help with the gnulib implementation. Bruno ___ Bug-coreutils mailing list Bug-coreutils@gnu.org http://lists.gnu.org/mailman/listinfo/bug-coreutils
Re: horrible utf-8 performace in wc
@@ -368,6 +370,8 @@ wc (int fd, char const *file_x, struct fstatus *fstatus) linepos += width; if (iswspace (wide_char)) goto mb_word_separator; + else if (uc_combining_class (wide_char) != 0) + chars--; /* don't count combining chars */ in_word = true; } break; If you want a tool to ignore combining characters (not 'wc -m', since 'wc -m' is not specified to behave like this, see the other mail), then uc_combining_class from gnulib is a usable API. However, in this patch you are assuming an UTF-8 locale. Recall that on some systems (Solaris, FreeBSD, ...) in EUC-JP locale for example, the wide-character representation of a double-byte character is unrelated to Unicode: the mbrtowc routine just combines the two bytes in a single wchar_t with a bit of shifting and masking; no conversion to Unicode takes place here. If you want to convert a byte sequence from the locale's encoding to a sequence of Unicode characters, in order to use uc_combining_class and similar API, you can do so through the gnulib function u32_conv_from_encoding (using locale_charset() as encoding). It's defined in gnulib's uniconv.h file. Bruno ___ Bug-coreutils mailing list Bug-coreutils@gnu.org http://lists.gnu.org/mailman/listinfo/bug-coreutils
Re: locales for testing
Jim Meyering wrote: you'll need to include the new test only if there is sufficient multi-byte support and if you can find a suitable locale to test with. gnulib has a few autoconf macros to determine suitables locales: gt_LOCALE_FR_UTF8 - french locale with UTF-8 encoding - Use this to verify basic operation in UTF-8 locales. gt_LOCALE_TR_UTF8 - turkish locale with UTF-8 encoding - Use this to verify upcase/downcase operations. gt_LOCALE_FR- french locale with unibyte encoding - Use this to verify classical unibyte locales. gt_LOCALE_ZH_CN - chinese locale with GB18030 encoding - Use this to verify operation in locales which have Unicode characters but don't use UTF-8. Bruno ___ Bug-coreutils mailing list Bug-coreutils@gnu.org http://lists.gnu.org/mailman/listinfo/bug-coreutils
Re: BugReport about ln command worked in NTFS
[ re-adding bug-coreutils@gnu.org ] On Thu, 8 May 2008, [EMAIL PROTECTED] wrote: The complete log about running ln is in the attachment. The strace -c output you posted shows 1 successful call to link(2), as I'd expect. It then shows further expected output from stat(1) that the link count is 2 for both filenames. Your initial report stated that rm was failing to remove one of the links, but your sample output doesn't show any use of rm, so it's impossible to see the problem being demonstrated. Please try running the following commands on the affected filesystem and send back the output: $ touch test1 $ ln test1 test2 $ ls -l $ strace -e trace=unlink rm test1 $ ls -l Cheers, Phil ___ Bug-coreutils mailing list Bug-coreutils@gnu.org http://lists.gnu.org/mailman/listinfo/bug-coreutils
Re: horrible utf-8 performace in wc
Bruno Haible wrote: If you want wc to count characters after canonicalization, then you can invent a new wc command-line option for it. But I would find it more useful to have a filter program that reads from standard input and writes the canonicalized output to standard output; that would be applicable in many more situations. I like the sound of that! I suppose the not-yet-implemented gnulib Unicode normalization library you mentioned in another post would be a prerequisite for such a tool. I'm definitely interested in helping out here, but I think someone with a more thorough understanding of Unicode would probably be more useful (Pádraig?) Bo ___ Bug-coreutils mailing list Bug-coreutils@gnu.org http://lists.gnu.org/mailman/listinfo/bug-coreutils
Re: horrible utf-8 performace in wc
Bruno Haible wrote: As a consequence: - The number of characters is the same as the number of wide characters. - wc -m must output the number of characters. - In a Unicode locale, U00E9 is one character, and U0065U0301 is two characters, Fair enough. If you want wc to count characters after canonicalization, then you can invent a new wc command-line option for it. I guess one would could possibly have --chars={unicode,glyph,grapheme,column} with unicode being the default, and how it currently works. But I would find it more useful to have a filter program that reads from standard input and writes the canonicalized output to standard output; that would be applicable in many more situations. That would be _very_ useful, yes. thanks for all the great info in this thread, Pádraig. ___ Bug-coreutils mailing list Bug-coreutils@gnu.org http://lists.gnu.org/mailman/listinfo/bug-coreutils
Re: horrible utf-8 performace in wc
$ time ./wc -m long_lines.txt 13357046 long_lines.txt real0m1.860s It processes at the speed of 7 million characters per second. I would not call this a horrible performance. However wc calls mbrtowc() for each multibyte character. Yes. One could use mbstowcs (or mbsnrtowcs, but that exists in glibc only). Or one can avoid the calls to mbrtowc() when the character is in the basic POSIX character set (i.e. most of ASCII). This trick comes from Paul Eggert and is already realized in gnulib's mbiter.h and mbswidth.c. Applied here, it hardly changes the code but speeds it up by a factor of 3. Timing with original coreutils-6.11: $ time wc -w SuSE-9.0-DVD-ARCHIVES 6999399 real2m26.211s user2m8.553s sys 0m1.046s $ time wc -m SuSE-9.0-DVD-ARCHIVES 120602576 real2m17.754s user2m8.164s sys 0m0.919s Timing with this patch: $ time /build/coreutils-6.11/src/wc -w SuSE-9.0-DVD-ARCHIVES 6999399 real0m42.101s user0m40.179s sys 0m0.875s $ time /build/coreutils-6.11/src/wc -m SuSE-9.0-DVD-ARCHIVES 120602576 real0m41.609s user0m40.171s sys 0m0.908s So the resulting counts are the same, and the time to process a 120 MB file is reduced from 128 sec to 40 sec, i.e. the speed increases from 0.94 MB/sec to 3.0 MB/sec. 2008-05-08 Bruno Haible [EMAIL PROTECTED] Speed up wc -m and wc -w in multibyte case. * src/wc.c: Include mbchar.h. (wc): New variable in_shift. Use it to avoid calling mbrtowc for most ASCII characters. *** coreutils-6.11/src/wc.c.bak 2008-04-19 23:34:23.0 +0200 --- coreutils-6.11/src/wc.c 2008-05-08 16:18:25.0 +0200 *** *** 1,5 /* wc - print the number of lines, words, and bytes in files !Copyright (C) 85, 91, 1995-2007 Free Software Foundation, Inc. This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by --- 1,5 /* wc - print the number of lines, words, and bytes in files !Copyright (C) 85, 91, 1995-2008 Free Software Foundation, Inc. This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by *** *** 28,33 --- 28,34 #include system.h #include error.h #include inttostr.h + #include mbchar.h #include quote.h #include readtokens0.h #include safe-read.h *** *** 274,279 --- 275,281 bool in_word = false; uintmax_t linepos = 0; mbstate_t state = { 0, }; + bool in_shift = false; # if SUPPORT_OLD_MBRTOWC /* Back-up the state before each multibyte character conversion and move the last incomplete character of the buffer to the front *** *** 308,377 wchar_t wide_char; size_t n; ! # if SUPPORT_OLD_MBRTOWC ! backup_state = state; ! # endif ! n = mbrtowc (wide_char, p, bytes_read, state); ! if (n == (size_t) -2) { ! # if SUPPORT_OLD_MBRTOWC ! state = backup_state; ! # endif ! break; ! } ! if (n == (size_t) -1) ! { ! /* Remember that we read a byte, but don't complain !about the error. Because of the decoding error, !this is a considered to be byte but not a !character (that is, chars is not incremented). */ ! p++; ! bytes_read--; } else { if (n == 0) { wide_char = 0; n = 1; } ! p += n; ! bytes_read -= n; ! chars++; ! switch (wide_char) { ! case '\n': ! lines++; ! /* Fall through. */ ! case '\r': ! case '\f': ! if (linepos linelength) ! linelength = linepos; ! linepos = 0; ! goto mb_word_separator; ! case '\t': ! linepos += 8 - (linepos % 8); ! goto mb_word_separator; ! case ' ': ! linepos++; ! /* Fall through. */ ! case '\v': ! mb_word_separator: ! words += in_word; ! in_word = false; ! break; ! default: ! if (iswprint (wide_char)) ! { ! int width = wcwidth (wide_char); ! if (width 0) ! linepos += width; ! if (iswspace (wide_char)) !
Re: horrible utf-8 performace in wc
Bruno Haible [EMAIL PROTECTED] wrote: 2008-05-08 Bruno Haible [EMAIL PROTECTED] Speed up wc -m and wc -w in multibyte case. * src/wc.c: Include mbchar.h. (wc): New variable in_shift. Use it to avoid calling mbrtowc for most ASCII characters. Thanks! I've applied that. ___ Bug-coreutils mailing list Bug-coreutils@gnu.org http://lists.gnu.org/mailman/listinfo/bug-coreutils