feature request: error codes for 'rm'

2008-05-08 Thread Danny Rawlins
Hi I'm quite surprised 'rm' does not return a error code for no such 
file, I would like to see at least error code 1 so I can use it in a 
shell script, additional error codes might also be nice.


Regards,
Danny Rawlins
Romster
http://crux.nu/Public/DannyRawlins


___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils


Re: feature request: error codes for 'rm'

2008-05-08 Thread Danny Rawlins

Danny Rawlins wrote:
Hi I'm quite surprised 'rm' does not return a error code for no such 
file, I would like to see at least error code 1 so I can use it in a 
shell script, additional error codes might also be nice.


Regards,
Danny Rawlins
Romster
http://crux.nu/Public/DannyRawlins

Damn it sorry I made a mistake and I did not check properly sorry for 
wasting your time.


Regards,
Danny Rawlins


___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils


Problème sous linux

2008-05-08 Thread nel natou
Bonjour.
  J'ai instalé Ubuntu sur mon pc en dual boot avec windows et j'ai des 
problèmes d'éceran ou de fréquence de raffraichissement.
  ce qui es bizarre, c'est qu'avant ça n'affectait que linux maintenant, ça 
vient même sous windows et parfois c'est très long.
  Je ne sia spas quoi faire.
  J'ai installé debian au lieu de ubuntu et ça continue
  Que faire? 

 __
Do You Yahoo!?
En finir avec le spam? Yahoo! Mail vous offre la meilleure protection possible 
contre les messages non sollicités 
http://mail.yahoo.fr Yahoo! Mail 
___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils


Re: horrible utf-8 performace in wc

2008-05-08 Thread Bruno Haible
Pádraig Brady wrote:
 mbstowcs doesn't canonicalize equivalent multibyte sequences,
 and so therefore functions the same in this regard as our
 processing of each wide character separately.
 This could be considered a bug actually- i.e. should -m give
 the number of wide chars, or the number of multibyte chars?
 With the attached patch, `wc -m` gives 23 chars for both these lines.

The behaviour of wc -m is specified by POSIX [1] to output the number
of characters. And:
  LC_CTYPE
Determine the locale for the interpretation of sequences of bytes of text
data as characters (for example, single-byte as opposed to multi-byte
characters in arguments and input files) and which characters are defined
as white space characters.

The definition of Character in [2] means a multibyte-character. IMO it
cannot be interpreted to mean a glyph, or a grapheme cluster, or a screen
column. Rather, it is the unit that is processed by a call to mbtowc [3] or
mbrtowc [4].

As a consequence:
  - The number of characters is the same as the number of wide characters.
  - wc -m must output the number of characters.
  - In a Unicode locale, U00E9 is one character, and U0065U0301 is
two characters,
* even if they are canonically equivalent (because POSIX does not make
  reference to this concept), and
* even if they render the same on the screen (because except for Curses,
  POSIX does not refer to the rendering of characters).

If you want wc to count characters after canonicalization, then you can
invent a new wc command-line option for it. But I would find it more useful
to have a filter program that reads from standard input and writes the
canonicalized output to standard output; that would be applicable in many
more situations.

Bruno

[1] http://www.opengroup.org/susv3/utilities/wc.html
[2] http://www.opengroup.org/susv3/basedefs/xbd_chap03.html
[3] http://www.opengroup.org/susv3/functions/mbtowc.html
[4] http://www.opengroup.org/susv3/functions/mbrtowc.html



___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils


Re: horrible utf-8 performace in wc

2008-05-08 Thread Bruno Haible
 Is there a good library for combining-character canonicalization
 available?  That seems like something that would be useful to have in a
 lot of text-processing tools.  Also, for Unicode, something to shuffle
 between the normalization forms might be helpful for comparisons.

Such functionality is currently available in IBM's ICU, in GNOME's libunicode, 
in
Simon's libidn, and should be available in some time in gnulib. Please contact
me if you want to help with the gnulib implementation.

Bruno



___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils


Re: horrible utf-8 performace in wc

2008-05-08 Thread Bruno Haible
 @@ -368,6 +370,8 @@ wc (int fd, char const *file_x, struct fstatus *fstatus)
 linepos += width;
   if (iswspace (wide_char))
 goto mb_word_separator;
 + else if (uc_combining_class (wide_char) != 0)
 +   chars--; /* don't count combining chars */
   in_word = true;
 }
   break;

If you want a tool to ignore combining characters (not 'wc -m', since 'wc -m'
is not specified to behave like this, see the other mail), then
uc_combining_class from gnulib is a usable API.

However, in this patch you are assuming an UTF-8 locale. Recall that on some
systems (Solaris, FreeBSD, ...) in EUC-JP locale for example, the wide-character
representation of a double-byte character is unrelated to Unicode: the mbrtowc
routine just combines the two bytes in a single wchar_t with a bit of shifting
and masking; no conversion to Unicode takes place here.

If you want to convert a byte sequence from the locale's encoding to a
sequence of Unicode characters, in order to use uc_combining_class and similar
API, you can do so through the gnulib function u32_conv_from_encoding
(using locale_charset() as encoding). It's defined in gnulib's uniconv.h file.

Bruno



___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils


Re: locales for testing

2008-05-08 Thread Bruno Haible
Jim Meyering wrote:
 you'll need to include the new test only if there is
 sufficient multi-byte support and if you can find a suitable locale to
 test with.

gnulib has a few autoconf macros to determine suitables locales:

  gt_LOCALE_FR_UTF8   - french locale with UTF-8 encoding
  - Use this to verify basic operation in UTF-8 locales.

  gt_LOCALE_TR_UTF8   - turkish locale with UTF-8 encoding
  - Use this to verify upcase/downcase operations.

  gt_LOCALE_FR- french locale with unibyte encoding
  - Use this to verify classical unibyte locales.

  gt_LOCALE_ZH_CN - chinese locale with GB18030 encoding
  - Use this to verify operation in locales which have
Unicode characters but don't use UTF-8.

Bruno



___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils


Re: BugReport about ln command worked in NTFS

2008-05-08 Thread Philip Rowlands

[ re-adding bug-coreutils@gnu.org ]

On Thu, 8 May 2008, [EMAIL PROTECTED] wrote:


The complete log about running ln is in the attachment.


The strace -c output you posted shows 1 successful call to link(2), as 
I'd expect. It then shows further expected output from stat(1) that the 
link count is 2 for both filenames.


Your initial report stated that rm was failing to remove one of the 
links, but your sample output doesn't show any use of rm, so it's 
impossible to see the problem being demonstrated.


Please try running the following commands on the affected filesystem and 
send back the output:


$ touch test1
$ ln test1 test2
$ ls -l
$ strace -e trace=unlink rm test1
$ ls -l


Cheers,
Phil


___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils


Re: horrible utf-8 performace in wc

2008-05-08 Thread Bo Borgerson
Bruno Haible wrote:
 If you want wc to count characters after canonicalization, then you can
 invent a new wc command-line option for it. But I would find it more useful
 to have a filter program that reads from standard input and writes the
 canonicalized output to standard output; that would be applicable in many
 more situations.


I like the sound of that!

I suppose the not-yet-implemented gnulib Unicode normalization library
you mentioned in another post would be a prerequisite for such a tool.

I'm definitely interested in helping out here, but I think someone with
a more thorough understanding of Unicode would probably be more useful
(Pádraig?)

Bo


___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils


Re: horrible utf-8 performace in wc

2008-05-08 Thread Pádraig Brady
Bruno Haible wrote:
 As a consequence:
   - The number of characters is the same as the number of wide characters.
   - wc -m must output the number of characters.
   - In a Unicode locale, U00E9 is one character, and U0065U0301 is
 two characters,

Fair enough.

 If you want wc to count characters after canonicalization, then you can
 invent a new wc command-line option for it.

I guess one would could possibly have --chars={unicode,glyph,grapheme,column}
with unicode being the default, and how it currently works.

 But I would find it more useful
 to have a filter program that reads from standard input and writes the
 canonicalized output to standard output; that would be applicable in many
 more situations.

That would be _very_ useful, yes.

thanks for all the great info in this thread,
Pádraig.



___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils


Re: horrible utf-8 performace in wc

2008-05-08 Thread Bruno Haible
 $ time ./wc -m long_lines.txt
 13357046 long_lines.txt
 real0m1.860s

It processes at the speed of 7 million characters per second. I would not call
this a horrible performance.

 However wc calls mbrtowc() for each multibyte character.

Yes. One could use mbstowcs (or mbsnrtowcs, but that exists in glibc only).
Or one can avoid the calls to mbrtowc() when the character is in the basic
POSIX character set (i.e. most of ASCII). This trick comes from Paul Eggert
and is already realized in gnulib's mbiter.h and mbswidth.c. Applied here,
it hardly changes the code but speeds it up by a factor of 3.

Timing with original coreutils-6.11:
$ time wc -w  SuSE-9.0-DVD-ARCHIVES 
6999399

real2m26.211s
user2m8.553s
sys 0m1.046s
$ time wc -m  SuSE-9.0-DVD-ARCHIVES 
120602576

real2m17.754s
user2m8.164s
sys 0m0.919s

Timing with this patch:
$ time /build/coreutils-6.11/src/wc -w  SuSE-9.0-DVD-ARCHIVES 
6999399

real0m42.101s
user0m40.179s
sys 0m0.875s
$ time /build/coreutils-6.11/src/wc -m  SuSE-9.0-DVD-ARCHIVES 
120602576

real0m41.609s
user0m40.171s
sys 0m0.908s

So the resulting counts are the same, and the time to process a 120 MB file
is reduced from 128 sec to 40 sec, i.e. the speed increases from 0.94 MB/sec
to 3.0 MB/sec.


2008-05-08  Bruno Haible  [EMAIL PROTECTED]

Speed up wc -m and wc -w in multibyte case.
* src/wc.c: Include mbchar.h.
(wc): New variable in_shift. Use it to avoid calling mbrtowc for most
ASCII characters.

*** coreutils-6.11/src/wc.c.bak 2008-04-19 23:34:23.0 +0200
--- coreutils-6.11/src/wc.c 2008-05-08 16:18:25.0 +0200
***
*** 1,5 
  /* wc - print the number of lines, words, and bytes in files
!Copyright (C) 85, 91, 1995-2007 Free Software Foundation, Inc.
  
 This program is free software: you can redistribute it and/or modify
 it under the terms of the GNU General Public License as published by
--- 1,5 
  /* wc - print the number of lines, words, and bytes in files
!Copyright (C) 85, 91, 1995-2008 Free Software Foundation, Inc.
  
 This program is free software: you can redistribute it and/or modify
 it under the terms of the GNU General Public License as published by
***
*** 28,33 
--- 28,34 
  #include system.h
  #include error.h
  #include inttostr.h
+ #include mbchar.h
  #include quote.h
  #include readtokens0.h
  #include safe-read.h
***
*** 274,279 
--- 275,281 
bool in_word = false;
uintmax_t linepos = 0;
mbstate_t state = { 0, };
+   bool in_shift = false;
  # if SUPPORT_OLD_MBRTOWC
/* Back-up the state before each multibyte character conversion and
 move the last incomplete character of the buffer to the front
***
*** 308,377 
  wchar_t wide_char;
  size_t n;
  
! # if SUPPORT_OLD_MBRTOWC
! backup_state = state;
! # endif
! n = mbrtowc (wide_char, p, bytes_read, state);
! if (n == (size_t) -2)
{
! # if SUPPORT_OLD_MBRTOWC
! state = backup_state;
! # endif
! break;
!   }
! if (n == (size_t) -1)
!   {
! /* Remember that we read a byte, but don't complain
!about the error.  Because of the decoding error,
!this is a considered to be byte but not a
!character (that is, chars is not incremented).  */
! p++;
! bytes_read--;
}
  else
{
  if (n == 0)
{
  wide_char = 0;
  n = 1;
}
! p += n;
! bytes_read -= n;
! chars++;
! switch (wide_char)
{
!   case '\n':
! lines++;
! /* Fall through. */
!   case '\r':
!   case '\f':
! if (linepos  linelength)
!   linelength = linepos;
! linepos = 0;
! goto mb_word_separator;
!   case '\t':
! linepos += 8 - (linepos % 8);
! goto mb_word_separator;
!   case ' ':
! linepos++;
! /* Fall through. */
!   case '\v':
!   mb_word_separator:
! words += in_word;
! in_word = false;
! break;
!   default:
! if (iswprint (wide_char))
!   {
! int width = wcwidth (wide_char);
! if (width  0)
!   linepos += width;
! if (iswspace (wide_char))
!  

Re: horrible utf-8 performace in wc

2008-05-08 Thread Jim Meyering
Bruno Haible [EMAIL PROTECTED] wrote:
 2008-05-08  Bruno Haible  [EMAIL PROTECTED]

   Speed up wc -m and wc -w in multibyte case.
   * src/wc.c: Include mbchar.h.
   (wc): New variable in_shift. Use it to avoid calling mbrtowc for most
   ASCII characters.

Thanks!
I've applied that.


___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils