Re: cut -DF

2022-01-25 Thread Assaf Gordon

Hello,

Here's an updated patch for "cut -DF".
Since it's a new code path, it opens the possibility of finally 
supporting multibyte characters with "cut -c".



comments very welcomed,
 - assaf

 [PATCH 01/18] cut: set-fields: add no-sort options
 [PATCH 02/18] cut: iniitial -D implmentation, currently only with
 [PATCH 03/18] tests: add 'cut -D' tests
 [PATCH 04/18] cut: extract 'cut -D -f' to a separate function
 [PATCH 05/18] cut: implement -D with -b
 [PATCH 06/18] tests: add 'cut -D -b' tests
 [PATCH 07/18] cut: add -O short-option for --output-delimiter
 [PATCH 08/18] cut: implement -F
 [PATCH 09/18] tests: add 'cut -F' tests
 [PATCH 10/18] cut: extract cut-fields into separate functions
 [PATCH 11/18] cut: implement multibyte -c/--characters
 [PATCH 12/18] cut: change -F regex syntax to BRE
 [PATCH 13/18] cut: change -D long-option equivalent
 [PATCH 14/18] doc: mention 'cut -D' in NEWS
 [PATCH 15/18] doc: mention 'cut -F' in NEWS
 [PATCH 16/18] doc: mention 'cut -O' in NEWS
 [PATCH 17/18] doc: mention multibyte 'cut -c' in NEWS
 [PATCH 18/18] doc: expand 'cut' section
From 2557ced8cb30655ef55c8532d814798172b5c392 Mon Sep 17 00:00:00 2001
From: Assaf Gordon 
Date: Wed, 5 Jan 2022 13:03:39 -0700
Subject: [PATCH 01/18] cut: set-fields: add no-sort options

---
 src/set-fields.c | 27 +++
 src/set-fields.h |  4 +++-
 2 files changed, 18 insertions(+), 13 deletions(-)

diff --git a/src/set-fields.c b/src/set-fields.c
index e3cce30d9..5e4ee6715 100644
--- a/src/set-fields.c
+++ b/src/set-fields.c
@@ -279,22 +279,25 @@ set_fields (char const *fieldstr, unsigned int options)
  ? _("missing list of byte/character positions")
  : _("missing list of fields"));
 
-  qsort (frp, n_frp, sizeof (frp[0]), compare_ranges);
-
-  /* Merge range pairs (e.g. `2-5,3-4' becomes `2-5'). */
-  for (size_t i = 0; i < n_frp; ++i)
+  if (!(options & SETFLD_NO_SORT))
 {
-  for (size_t j = i + 1; j < n_frp; ++j)
+  qsort (frp, n_frp, sizeof (frp[0]), compare_ranges);
+
+  /* Merge range pairs (e.g. `2-5,3-4' becomes `2-5'). */
+  for (size_t i = 0; i < n_frp; ++i)
 {
-  if (frp[j].lo <= frp[i].hi)
+  for (size_t j = i + 1; j < n_frp; ++j)
 {
-  frp[i].hi = MAX (frp[j].hi, frp[i].hi);
-  memmove (frp + j, frp + j + 1, (n_frp - j - 1) * sizeof *frp);
-  n_frp--;
-  j--;
+  if (frp[j].lo <= frp[i].hi)
+{
+  frp[i].hi = MAX (frp[j].hi, frp[i].hi);
+  memmove (frp + j, frp + j + 1, (n_frp - j - 1) * sizeof *frp);
+  n_frp--;
+  j--;
+}
+  else
+break;
 }
-  else
-break;
 }
 }
 
diff --git a/src/set-fields.h b/src/set-fields.h
index 7bc9b3afe..9127d9957 100644
--- a/src/set-fields.h
+++ b/src/set-fields.h
@@ -34,8 +34,10 @@ enum
 {
   SETFLD_ALLOW_DASH = 0x01, /* allow single dash meaning 'all fields' */
   SETFLD_COMPLEMENT = 0x02, /* complement the field list */
-  SETFLD_ERRMSG_USE_POS = 0x04  /* when reporting errors, say 'position' instead
+  SETFLD_ERRMSG_USE_POS = 0x04, /* when reporting errors, say 'position' instead
of 'field' (used with cut -b/-c) */
+  SETFLD_NO_SORT= 0x08  /* Do not sort the fields; keep duplicated
+   and overlapped fields */
 };
 
 /* allocates and initializes the FRP array and N_FRP count */
-- 
2.30.2

From 6db6c47aabe5c0ba194cecb1f8f24957b65e1556 Mon Sep 17 00:00:00 2001
From: Assaf Gordon 
Date: Wed, 5 Jan 2022 13:04:08 -0700
Subject: [PATCH 02/18] cut: iniitial -D implmentation, currently only with
 "-f"

---
 src/cut.c | 161 --
 1 file changed, 156 insertions(+), 5 deletions(-)

diff --git a/src/cut.c b/src/cut.c
index 5143c8bd9..84caad091 100644
--- a/src/cut.c
+++ b/src/cut.c
@@ -20,7 +20,9 @@
 /* POSIX changes, bug fixes, long-named options, and cleanup
by David MacKenzie .
 
-   Rewrite cut_fields and cut_bytes -- Jim Meyering.  */
+   Rewrite cut_fields and cut_bytes -- Jim Meyering.
+
+   Match toybox's -D,-F,-O options -- Assaf Gordon. */
 
 #include 
 
@@ -43,7 +45,8 @@
 #define AUTHORS \
   proper_name ("David M. Ihnat"), \
   proper_name ("David MacKenzie"), \
-  proper_name ("Jim Meyering")
+  proper_name ("Jim Meyering"), \
+  proper_name ("Assaf Gordon")
 
 #define FATAL_ERROR(Message)		\
   do	\
@@ -113,6 +116,15 @@ static char *output_delimiter_string;
 /* True if we have ever read standard input. */
 static bool have_read_stdin;
 
+/* If true use different (but less optimized) code,
+   Used with -F and/or -D.  */
+static bool adv_mode;
+
+/* True if -D is used: allow duplicate

Re: Compilations warnings-as-errors when building from git

2022-01-13 Thread Assaf Gordon

follow-up:

On 2022-01-13 11:22 p.m., Assaf Gordon wrote:
I'm getting few warnings-as-errors when building the latest version from 
git (using Debian 10 amd64 with gcc 8.3.0).


with clang-14 ( Debian clang version 
14.0.0-++20211220125923+c79a67196828-1~exp1~20211220130019.184 )


I'm seeing the following warnings:

---

src/uptime.c:75:47: warning: implicit conversion from 'time_t' (aka 
'long') to 'double' changes value from 9223372036854775807 to 
9223372036854775808 [-Wimplicit-cons

t-int-float-conversion]
uptime = (0 <= upsecs && upsecs < TYPE_MAXIMUM (time_t)
~ ^
./lib/intprops.h:57:4: note: expanded from macro 'TYPE_MAXIMUM'
  ((t) (! TYPE_SIGNED (t)   \
---

src/ls.c:2287:33: warning: result of comparison of constant 
9223372036854775807 with expression of type 'unsigned short' is always 
true [-Wtautological-constant-out-of-range-compare]
linelen = ws.ws_col <= MIN (PTRDIFF_MAX, SIZE_MAX) ? 
ws.ws_col : 0;

  ~ ^  ~~~
1 warning generated.



src/sort.c:1414:21: warning: implicit conversion from 'unsigned long' to 
'double' changes value from 18446744073709551615 to 18446744073709551616 
[-Wimplicit-const-int-float-conversion]

  if (mem < UINTMAX_MAX)
  ~ ^~~
/usr/include/stdint.h:202:24: note: expanded from macro 'UINTMAX_MAX'
# define UINTMAX_MAX(__UINT64_C(18446744073709551615))
 ^~~~
/usr/include/stdint.h:107:25: note: expanded from macro '__UINT64_C'
#  define __UINT64_C(c) c ## UL
^~~
:21:1: note: expanded from here
18446744073709551615UL
^~
1 warning generated.

-

Also, when compiling gnulib modules there is this:

warning: unknown warning option '-Wno-unsuffixed-float-constants' 
[-Wunknown-warning-option]


Which, I see was removed from gnulib in 2011,
and reinstated just now in
https://git.savannah.gnu.org/cgit/gnulib.git/commit/?id=0c8a563f65d44752b33aec42cceec25bd485f2d5

---



Compilations warnings-as-errors when building from git

2022-01-13 Thread Assaf Gordon

Hi all,

I'm getting few warnings-as-errors when building the latest version from 
git (using Debian 10 amd64 with gcc 8.3.0).
I can send a patch for the "malloc" one, but not sure about the 
intricates of intprops.h .


- assaf



lib/randperm.c: In function 'sparse_new':
lib/randperm.c:111:1: error: function might be candidate for attribute 
'malloc' if it is known to return normally 
[-Werror=suggest-attribute=malloc]

 sparse_new (size_t size_hint)
 ^~



src/stat.c: In function 'default_format':
src/stat.c:1653:1: error: function might be candidate for attribute 
'malloc' if it is known to return normally 
[-Werror=suggest-attribute=malloc]

 default_format (bool fs, bool terse, bool device)
 ^~
cc1: all warnings being treated as errors



( This failure is in pinky.c and the same in csplit.c )

In file included from ./lib/xalloc.h:27,
 from src/system.h:244, 



 from src/pinky.c:25: 


src/pinky.c: In function 'create_fullname':
./lib/intprops.h:44:55: error: comparison of unsigned expression < 0 is 
always false [-Werror=type-limits]

 #define EXPR_SIGNED(e) (_GL_INT_NEGATE_CONVERT (e, 1) < 0)
   ^
./lib/intprops.h:407:42: note: in expansion of macro 'EXPR_SIGNED' 



 ((!_GL_SIGNED_TYPE_OR_EXPR (*(r)) && EXPR_SIGNED (a) && 
EXPR_SIGNED (b) \

  ^~~
src/pinky.c:115:11: note: in expansion of macro 'INT_MULTIPLY_WRAPV'
   if (INT_MULTIPLY_WRAPV (ulen, ampersands - 1, )
   ^~
./lib/intprops.h:44:55: error: comparison of unsigned expression < 0 is 
always false [-Werror=type-limits] 

 #define EXPR_SIGNED(e) (_GL_INT_NEGATE_CONVERT (e, 1) < 0) 


   ^
./lib/intprops.h:407:61: note: in expansion of macro 'EXPR_SIGNED'
 ((!_GL_SIGNED_TYPE_OR_EXPR (*(r)) && EXPR_SIGNED (a) && 
EXPR_SIGNED (b) \

 ^~~
src/pinky.c:115:11: note: in expansion of macro 'INT_MULTIPLY_WRAPV' 


   if (INT_MULTIPLY_WRAPV (ulen, ampersands - 1, )
   ^~
./lib/intprops.h:588:8: error: comparison of unsigned expression < 0 is 
always false [-Werror=type-limits]

   ((b) < 0 \
^
./lib/intprops.h:408:10: note: in expansion of macro 
'_GL_INT_MULTIPLY_RANGE_OVERFLOW'
   && _GL_INT_MULTIPLY_RANGE_OVERFLOW (a, b, 0, (__typeof__ (*(r))) 
-1)) \

  ^~~
src/pinky.c:115:11: note: in expansion of macro 'INT_MULTIPLY_WRAPV' 



   if (INT_MULTIPLY_WRAPV (ulen, ampersands - 1, ) 


   ^~
./lib/intprops.h:589:11: error: comparison of unsigned expression < 0 is 
always false [-Werror=type-limits]

? ((a) < 0 \
   ^
./lib/intprops.h:408:10: note: in expansion of macro 
'_GL_INT_MULTIPLY_RANGE_OVERFLOW'
   && _GL_INT_MULTIPLY_RANGE_OVERFLOW (a, b, 0, (__typeof__ (*(r))) 
-1)) \

  ^~~
src/pinky.c:115:11: note: in expansion of macro 'INT_MULTIPLY_WRAPV'
   if (INT_MULTIPLY_WRAPV (ulen, ampersands - 1, )
   ^~
./lib/intprops.h:44:55: error: comparison of unsigned expression < 0 is 
always false [-Werror=type-limits]

 #define EXPR_SIGNED(e) (_GL_INT_NEGATE_CONVERT (e, 1) < 0)
   ^
./lib/intprops.h:590:10: note: in expansion of macro 'EXPR_SIGNED'
   ? (EXPR_SIGNED (_GL_INT_CONVERT (tmax, b)) \
  ^~~
./lib/intprops.h:408:10: note: in expansion of macro 
'_GL_INT_MULTIPLY_RANGE_OVERFLOW'
   && _GL_INT_MULTIPLY_RANGE_OVERFLOW (a, b, 0, (__typeof__ (*(r))) 
-1)) \

  ^~~
src/pinky.c:115:11: note: in expansion of macro 'INT_MULTIPLY_WRAPV'
   if (INT_MULTIPLY_WRAPV (ulen, ampersands - 1, )
   ^~
./lib/intprops.h:44:55: error: comparison of unsigned expression < 0 is 
always false [-Werror=type-limits]

 #define EXPR_SIGNED(e) (_GL_INT_NEGATE_CONVERT (e, 1) < 0)
   ^
./lib/intprops.h:597:10: note: in expansion of macro 'EXPR_SIGNED'
   ? (EXPR_SIGNED (a) \
  ^~~
./lib/intprops.h:408:10: note: in expansion of macro 
'_GL_INT_MULTIPLY_RANGE_OVERFLOW'
   && _GL_INT_MULTIPLY_RANGE_OVERFLOW (a, b, 0, (__typeof__ (*(r))) 
-1)) \

  ^~~
src/pinky.c:115:11: note: in expansion of macro 'INT_MULTIPLY_WRAPV'
   if (INT_MULTIPLY_WRAPV (ulen, ampersands - 1, )
   ^~
./lib/intprops.h:603:11: error: comparison of unsigned expression < 0 is 
always false [-Werror=type-limits]

: ((a) < 0 \
   ^
./lib/intprops.h:408:10: note: in expansion of macro 
'_GL_INT_MULTIPLY_RANGE_OVERFLOW'
   && _GL_INT_MULTIPLY_RANGE_OVERFLOW (a, b, 0, (__typeof__ (*(r))) 
-1)) \

  

Re: cut -DF

2022-01-06 Thread Assaf Gordon

Hello,

On 2022-01-06 7:35 a.m., Pádraig Brady wrote:

Thanks for taking the time to consolidate options/functionality
across different implementations.  This is important for users.
Some notes below...

On 05/01/2022 16:23, Rob Landley wrote:

Around 5 years ago toybox added the -D, -F, and -O options to cut:

 -D  Don't sort/collate selections or match -fF lines without 
delimiter

 -F  Select fields separated by DELIM regex
 -O  Output delimiter (default one space for -F, input delim for -f)



As I see it, the main functionalities added here:
   - reordering of selected fields
   - adjusted suppression of lines without matching fields
   - regex delimiter support

I see regex support as less important, but still useful.




Attached is a suggestion for initial implementation of "cut -FDO".
It's split into smaller steps to ease review.

The main issue is that the current "cut_fields" and "cut_bytes" are
highly optimized for speed, so I left them as-is and created a secondary
set of 'cut' functions - slower but with additional options.

If this is acceptable, I'll go on to clean up the patches, add more
tests and write documentation.

There are likely some edge-cases regarding regex matching that need to 
be decided upon (e.g. BRE or ERE, what about BOL/EOL anchors, groups, etc.).


Comments and feedback very welcomed,

regards,
 - assaf

>From dbfdef9a720c8ea9ed1a90a4e4c66aa7e0ed3e1f Mon Sep 17 00:00:00 2001
From: Assaf Gordon 
Date: Wed, 5 Jan 2022 13:03:39 -0700
Subject: [PATCH 1/9] cut: set-fields: add no-sort options

---
 src/set-fields.c | 27 +++
 src/set-fields.h |  4 +++-
 2 files changed, 18 insertions(+), 13 deletions(-)

diff --git a/src/set-fields.c b/src/set-fields.c
index e3cce30d9..5e4ee6715 100644
--- a/src/set-fields.c
+++ b/src/set-fields.c
@@ -279,22 +279,25 @@ set_fields (char const *fieldstr, unsigned int options)
  ? _("missing list of byte/character positions")
  : _("missing list of fields"));
 
-  qsort (frp, n_frp, sizeof (frp[0]), compare_ranges);
-
-  /* Merge range pairs (e.g. `2-5,3-4' becomes `2-5'). */
-  for (size_t i = 0; i < n_frp; ++i)
+  if (!(options & SETFLD_NO_SORT))
 {
-  for (size_t j = i + 1; j < n_frp; ++j)
+  qsort (frp, n_frp, sizeof (frp[0]), compare_ranges);
+
+  /* Merge range pairs (e.g. `2-5,3-4' becomes `2-5'). */
+  for (size_t i = 0; i < n_frp; ++i)
 {
-  if (frp[j].lo <= frp[i].hi)
+  for (size_t j = i + 1; j < n_frp; ++j)
 {
-  frp[i].hi = MAX (frp[j].hi, frp[i].hi);
-  memmove (frp + j, frp + j + 1, (n_frp - j - 1) * sizeof *frp);
-  n_frp--;
-  j--;
+  if (frp[j].lo <= frp[i].hi)
+{
+  frp[i].hi = MAX (frp[j].hi, frp[i].hi);
+  memmove (frp + j, frp + j + 1, (n_frp - j - 1) * sizeof *frp);
+  n_frp--;
+  j--;
+}
+  else
+break;
 }
-  else
-break;
 }
 }
 
diff --git a/src/set-fields.h b/src/set-fields.h
index 7bc9b3afe..9127d9957 100644
--- a/src/set-fields.h
+++ b/src/set-fields.h
@@ -34,8 +34,10 @@ enum
 {
   SETFLD_ALLOW_DASH = 0x01, /* allow single dash meaning 'all fields' */
   SETFLD_COMPLEMENT = 0x02, /* complement the field list */
-  SETFLD_ERRMSG_USE_POS = 0x04  /* when reporting errors, say 'position' instead
+  SETFLD_ERRMSG_USE_POS = 0x04, /* when reporting errors, say 'position' instead
of 'field' (used with cut -b/-c) */
+  SETFLD_NO_SORT= 0x08  /* Do not sort the fields; keep duplicated
+   and overlapped fields */
 };
 
 /* allocates and initializes the FRP array and N_FRP count */
-- 
2.20.1

>From d5d58eeb0bf5a399b2d65e174c72d0f8c11b2c01 Mon Sep 17 00:00:00 2001
From: Assaf Gordon 
Date: Wed, 5 Jan 2022 13:04:08 -0700
Subject: [PATCH 2/9] cut: iniitial -D implmentation, currently only with "-f"

---
 src/cut.c | 161 --
 1 file changed, 156 insertions(+), 5 deletions(-)

diff --git a/src/cut.c b/src/cut.c
index 5143c8bd9..84caad091 100644
--- a/src/cut.c
+++ b/src/cut.c
@@ -20,7 +20,9 @@
 /* POSIX changes, bug fixes, long-named options, and cleanup
by David MacKenzie .
 
-   Rewrite cut_fields and cut_bytes -- Jim Meyering.  */
+   Rewrite cut_fields and cut_bytes -- Jim Meyering.
+
+   Match toybox's -D,-F,-O options -- Assaf Gordon. */
 
 #include 
 
@@ -43,7 +45,8 @@
 #define AUTHORS \
   proper_name ("David M. Ihnat"), \
   proper_name ("David MacKenzie"), \
-  proper_name ("Jim Meyering")
+  proper_name ("Jim Meyering"), \
+  proper_name ("Assaf Gordon")
 
 #define FATAL_ERROR(Message)		

Re: cut -DF

2022-01-05 Thread Assaf Gordon

Hello Rob and all,


On 2022-01-05 9:23 a.m., Rob Landley wrote:

Around 5 years ago toybox added the -D, -F, and -O options to cut:

 -D  Don't sort/collate selections or match -fF lines without delimiter
 -F  Select fields separated by DELIM regex
 -O  Output delimiter (default one space for -F, input delim for -f)

[...]


Elliott Hughes (the Android base OS maintainer) asked if I could get the feature
more widely adopted:


your non-POSIX cut(1) extension covers 80% of the in-the-wild use of awk
anyway :-) 

[...]


This is working and in use in Android, and now in busybox, and it would simplify
my regression test suite if coreutils was in sync, so I thought I'd ask if you
were interested.



I personally like the idea (at the very list "-D" will indeed replace
awk for many simple use-cases).

I'm working on a proof-of-concept (will share later today for feedback 
and comments).


Do you mind sharing your test suite?

-assaf





bug#49741: basenc --base64url decoding bug

2021-08-29 Thread Assaf Gordon

tag 49741 fixed
close 49741
stop

On 2021-08-22 4:15 p.m., Assaf Gordon wrote:

Attached a suggested fix.


pushed in:

https://git.savannah.gnu.org/cgit/coreutils.git/commit/?id=709d1f8253072804cc27189a6f2b873d8d563399






bug#50151: Coreutils, aarch64 and chroot

2021-08-25 Thread Assaf Gordon

tag 50151 notabug
close 50151
stop

On 2021-08-25 12:54 p.m., Frans de Boer wrote:

On 8/25/21 10:16 AM, Assaf Gordon wrote:

  qemu-aarch64 -strace -L /newroot \
  /newroot/usr/sbin/chroot /newroot /usr/bin/env --version 2&1 \
  | tee log.txt

@assaf: your suggestions no. 1 and 2, had the predicted results. Thus, 
suggestion no. 3 failed because of suggestion no.2. I followed then 
suggestion 4 and attached the strace output to this message. It seems 
that chroot is working as expected, only env seems to fail with an error.


Not exactly:
The 'chroot' system-call *seems* to succeed,
followed by a failed "execve(2)" system call to execute another binary.
That "execve" system fails - so it is not 'env' per-se,
it is any program that will try to execute another aarch64 binary.

Learning that, searching for "qemu-user", "chroot" and "architecture"
leads to several web pages detailing similar errors (and few suggested
solutions):

https://wiki.gentoo.org/wiki/Crossdev_qemu-static-user-chroot

https://newbedev.com/how-can-i-chroot-into-a-filesystem-with-a-different-architechture

https://ownyourbits.com/2018/06/13/transparently-running-binaries-from-any-architecture-in-linux-with-qemu-and-binfmt_misc/


I hope you have some clue of what is going wrong.


With the above information, we can conclude this is not a bug
in coreutils - it is a limitation of the linux+qemu-user setup.

So I'm closing this item and marking it as "not a bug",
but discussion can continue by replying to this thread.

regards,
 - assaf








bug#50151: Coreutils, aarch64 and chroot

2021-08-25 Thread Assaf Gordon

Hello,

On 2021-08-24 2:39 a.m., Paul Eggert wrote:
However, I think it'll be a better use of our time for you to debug this 
one yourself. It doesn't sound like a Coreutils problem; it sounds like 
a problem in your virtual machine setup, and you're the best expert on 
that setup.


Few suggestions to check, that might help you and us to troubleshoot:

1. ensure the binaries are indeed for aarch64:

   file /newroot/usr/sbin/chroot
   file /newroot/usr/bin/env
   file /newroot/usr/bin/bash

it should say something like
  "ELF 64-bit LSB pie executable, ARM aarch64"
for all of them.


2. ensure each binary works by itself:

 qemu-aarch64 -L /newroot /newroot/usr/sbin/chroot --version
 qemu-aarch64 -L /newroot /newroot/usr/bin/env --version
 qemu-aarch64 -L /newroot /newroot/usr/bin/bash --version

(the actual version doesn't matter here, the main thing is that
the qemu user-mode emulator was able to run the binaries.)

On 2021-08-21 4:33 a.m., Frans de Boer wrote:


Running 'qemu-aarch64 -L /newroot /newroot/usr/bin/bash -c 
/usr/bin/env> --help' does show the env help text. So, I guess chroot

is to blame?

Note that the above command runs your *host's* /usr/bin/env
because chroot is not used - the binary under qemu
 (/newroot/usr/bin/bash) sees your host's file system.

Observe with:

  qemu-aarch64 -L /newroot /newroot/usr/bin/bash -c /bin/uname -m
  qemu-aarch64 -L /newroot /newroot/usr/bin/env /bin/uname -m

I'm guessing you will see "x86_64", not "aarch64".

3. What you should try is:

  qemu-aarch64 -L /newroot \
 /newroot/usr/bin/bash -c /newroot/usr/bin/env --version
and:
  qemu-aarch64 -L /newroot \
 /newroot/usr/bin/env /newroot/usr/bin/bash --version

In both cases, one aarch64 binary will try to execute another aach64 
binary. Do these work for you, or are you seeing an error?




4. Use qemu's "-strace" to see the syscalls, hopefully
that will help pinpoint the cause:

  qemu-aarch64 -strace -L /newroot \
  /newroot/usr/sbin/chroot /newroot /usr/bin/env --version 2&1 \
  | tee log.txt

If the command results in an error, the "log.txt" file will show
more details about what failed.
If you're not familiar with 'strace' output, post it here as an email 
attachment.



Hope this helps,
 - assaf

P.S.

On 2021-08-24 2:39 a.m., Paul Eggert wrote:

A complete set of instructions for an outsider to reproduce the
problem from scratch.  Assume the outsider is running Fedora 34
x86-64 (since that's what I'm running :-).

I'm not familiar with Fedora, but on Debian/x86_64 the following works:

   apt-get qemu-user
   apt-get install crossbuild-essential-arm64 libc6-arm64-cross

   cd coreutils
   ./configure --host=aarch64-linux-gnu
   make

then:

$ qemu-aarch64 -L /usr/aarch64-linux-gnu/ ./src/uname -m
aarch64

Somewhat related:

$ qemu-aarch64 -L /usr/aarch64-linux-gnu/ ./src/env ./src/uname -m
/lib/ld-linux-aarch64.so.1: No such file or directory

This fails because once "inside" qemu, the aarch64 searches for
"/lib/ld-linux-aarch64.so.1" but the file is in
"/usr/aarch64-linux-gnu/lib/ld-linux-aarch64.so.1".
One possible work-around is to build static binaries.

I don't want to assume that is the culprit for Frans, so we'll wait for 
the logs...








bug#49741: basenc --base64url decoding bug

2021-08-22 Thread Assaf Gordon

On 2021-08-17 3:37 a.m., Jim Meyering wrote:

On Tue, Aug 17, 2021 at 2:02 AM Pádraig Brady  wrote:

On 16/08/2021 22:17, Assaf Gordon wrote:


Attached a suggested fix.


minor nit in NEWS:

a nit in the commit log:


Thanks, attached updated patch.
Will push this week if there are no other comments.

-assaf



>From 090663068a23662b36ddc0603fc1c2c752b6aff1 Mon Sep 17 00:00:00 2001
From: Assaf Gordon 
Date: Mon, 16 Aug 2021 15:03:36 -0600
Subject: [PATCH] basenc: fix bug49741: using wrong decoding buffer length

Emil Lundberg  reports in
https://bugs.gnu.org/49741 about a 'basenc --base64 -d' decoding bug.
The input buffer length was not divisible by 3, resulting in
decoding errors.

* NEWS: Mention fix.
* src/basenc.c (DEC_BLOCKSIZE): Change from 1024*5 to 4200 (35*3*5*8)
which is divisible by 3,4,5,8 - satisfying both base32 and base64;
Use compile-time verify() macro to enforce the above.
* tests/misc/basenc.pl: Add test.
---
 NEWS | 4 
 src/basenc.c | 4 +++-
 tests/misc/basenc.pl | 9 +
 3 files changed, 16 insertions(+), 1 deletion(-)

diff --git a/NEWS b/NEWS
index ddec56bdf..efdb1450e 100644
--- a/NEWS
+++ b/NEWS
@@ -60,6 +60,10 @@ GNU coreutils NEWS-*- outline -*-
   invalid combinations of case character classes.
   [bug introduced in coreutils-8.6]
 
+  basenc --base64 --decode no longer silently discards decoded characters
+  on (1024*5) buffer boundaries
+  [bug introduced in coreutils-8.31]
+
 ** Changes in behavior
 
   cp and install now default to copy-on-write (COW) if available.
diff --git a/src/basenc.c b/src/basenc.c
index 5c97a3652..2ffdb2d27 100644
--- a/src/basenc.c
+++ b/src/basenc.c
@@ -213,7 +213,9 @@ verify (DEC_BLOCKSIZE % 12 == 0);  /* So complete encoded blocks are used.  */
 
 /* Note that increasing this may decrease performance if --ignore-garbage
is used, because of the memmove operation below.  */
-# define DEC_BLOCKSIZE (1024*5)
+# define DEC_BLOCKSIZE (4200)
+verify (DEC_BLOCKSIZE % 40 == 0); /* complete encoded blocks for base32 */
+verify (DEC_BLOCKSIZE % 12 == 0); /* complete encoded blocks for base64 */
 
 static int (*base_length) (int i);
 static bool (*isbase) (char ch);
diff --git a/tests/misc/basenc.pl b/tests/misc/basenc.pl
index 3383aaeef..ac5394731 100755
--- a/tests/misc/basenc.pl
+++ b/tests/misc/basenc.pl
@@ -37,6 +37,13 @@ my $base64url_out_nl = $base64url_out;
 $base64url_out_nl =~ s/(..)/\1\n/g; # add newline every two characters
 
 
+# Bug 49741:
+# The input  is 'abc' in base64, in an 8K buffer (larger than 1024*5,
+# the buffer size which caused the bug).
+my $base64_bug49741_in = "YWJj" x 2000 ;
+my $base64_bug49741_out = "abc" x 2000 ;
+
+
 my $base32_in = "\xfd\xd8\x07\xd1\xa5";
 my $base32_out = "7XMAPUNF";
 my $x = $base32_out;
@@ -111,6 +118,8 @@ my @Tests =
  ['b64u_7', '--base64url -d',  {IN=>$base64_out},
   {EXIT=>1},  {ERR=>"$prog: invalid input\n"}],
 
+ ['b64_bug49741', '--base64 -d',  {IN=>$base64_bug49741_in},
+  {OUT=>$base64_bug49741_out}],
 
 
 
-- 
2.20.1



bug#49741: basenc --base64url decoding bug

2021-08-16 Thread Assaf Gordon

Hello Emil and all,

Thanks for the clear and easily reproducible bug report.

Attached a suggested fix.
Comments very welcomed,

- Assaf

>From 11330058443e7cc92b4a53322d810725d42b4e34 Mon Sep 17 00:00:00 2001
From: Assaf Gordon 
Date: Mon, 16 Aug 2021 15:03:36 -0600
Subject: [PATCH] basenc: fix bug49741: using wrong decoding buffer length

Emil Lundberg  reports in
https://bugs.gnu.org/49741 about a 'basenc --base64 -d' decoding bug.
The input buffer was not divisible by 3, resulting in decoding errors.

* NEWS: Mention fix.
* src/basenc.c (DEC_BLOCKSIZE): Change from 1024*5 to 4200 (35*3*5*8)
which is divisible by 3,4,5,8 - satisfying both base32 and base64;
Use compile-time verify() macro to enforce the above.
* tests/misc/basenc.pl: Add test.
---
 NEWS | 4 
 src/basenc.c | 4 +++-
 tests/misc/basenc.pl | 9 +
 3 files changed, 16 insertions(+), 1 deletion(-)

diff --git a/NEWS b/NEWS
index ddec56bdf..d490ed101 100644
--- a/NEWS
+++ b/NEWS
@@ -60,6 +60,10 @@ GNU coreutils NEWS-*- outline -*-
   invalid combinations of case character classes.
   [bug introduced in coreutils-8.6]
 
+  basenc --base64 --decode no longer silently discard decoded characters
+  on (1024*5) buffer boundaries
+  [bug introduced in coreutils-8.31]
+
 ** Changes in behavior
 
   cp and install now default to copy-on-write (COW) if available.
diff --git a/src/basenc.c b/src/basenc.c
index 5c97a3652..2ffdb2d27 100644
--- a/src/basenc.c
+++ b/src/basenc.c
@@ -213,7 +213,9 @@ verify (DEC_BLOCKSIZE % 12 == 0);  /* So complete encoded blocks are used.  */
 
 /* Note that increasing this may decrease performance if --ignore-garbage
is used, because of the memmove operation below.  */
-# define DEC_BLOCKSIZE (1024*5)
+# define DEC_BLOCKSIZE (4200)
+verify (DEC_BLOCKSIZE % 40 == 0); /* complete encoded blocks for base32 */
+verify (DEC_BLOCKSIZE % 12 == 0); /* complete encoded blocks for base64 */
 
 static int (*base_length) (int i);
 static bool (*isbase) (char ch);
diff --git a/tests/misc/basenc.pl b/tests/misc/basenc.pl
index 3383aaeef..ac5394731 100755
--- a/tests/misc/basenc.pl
+++ b/tests/misc/basenc.pl
@@ -37,6 +37,13 @@ my $base64url_out_nl = $base64url_out;
 $base64url_out_nl =~ s/(..)/\1\n/g; # add newline every two characters
 
 
+# Bug 49741:
+# The input  is 'abc' in base64, in an 8K buffer (larger than 1024*5,
+# the buffer size which caused the bug).
+my $base64_bug49741_in = "YWJj" x 2000 ;
+my $base64_bug49741_out = "abc" x 2000 ;
+
+
 my $base32_in = "\xfd\xd8\x07\xd1\xa5";
 my $base32_out = "7XMAPUNF";
 my $x = $base32_out;
@@ -111,6 +118,8 @@ my @Tests =
  ['b64u_7', '--base64url -d',  {IN=>$base64_out},
   {EXIT=>1},  {ERR=>"$prog: invalid input\n"}],
 
+ ['b64_bug49741', '--base64 -d',  {IN=>$base64_bug49741_in},
+  {OUT=>$base64_bug49741_out}],
 
 
 
-- 
2.20.1



bug#49741: basenc --base64url decoding bug

2021-08-13 Thread Assaf Gordon

Hi,

I will also work on it this weekend.

 -assaf


On 2021-08-12 7:37 p.m., Paul Eggert wrote:
Simon, this looks like some sort of minor buffering problem in 'basenc 
--base64', since plain 'base64' works correctly. Is this something you 
have time to look into?


https://bugs.gnu.org/49741










Re: question

2021-04-29 Thread Assaf Gordon

Hello,

On 2021-04-29 12:34 p.m., steve.lowder.ctr--- via GNU coreutils General 
Discussion wrote:



Can you tell me what version of the GNU coreutils did the od command add the
-endian option?


Looking at the "NEWS" file ( 
https://git.savannah.gnu.org/cgit/coreutils.git/tree/NEWS#n1135 ),


"-endian" was added in version 8.23, released in July 2014
( https://git.savannah.gnu.org/cgit/coreutils.git/tree/NEWS#n1026 ).


regards,
 - assaf



Re: [PATCH] wc: Add AVX2 optimization when counting only lines

2021-04-21 Thread Assaf Gordon

Hello,

On 2021-03-29 7:21 a.m., Pádraig Brady wrote:

On 28/03/2021 18:29, Kristoffer Brånemyr via GNU coreutils General 
I wanted to practice some more using vector intrinsics, so I made a 
small AVX2 optimization for wc -l. Depending on line length it is 
about 2-5x faster than previous version. (Well, only looking at user 
time it is much faster than that even.)



Excellent results.
I'll review this very soon.



I'm attaching the patch (copied from the Github's pull-request),
hopefully we can continue the discussion here on the mailing list.

-assaf
>From 462386ea5aad1b1673f7c1bc51983374aad325a8 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Kristoffer=20Br=C3=A5nemyr?= 
Date: Sat, 20 Feb 2021 12:27:17 +0100
Subject: [PATCH] wc: Add AVX2 optimization when counting only lines

---
 configure.ac   |  46 ++
 po/POTFILES.in |   1 +
 src/local.mk   |   9 +++
 src/wc.c   | 162 -
 src/wc_avx2.c  | 115 +++
 5 files changed, 290 insertions(+), 43 deletions(-)
 create mode 100644 src/wc_avx2.c

diff --git a/configure.ac b/configure.ac
index 7fbecbf8d..8186b88f1 100644
--- a/configure.ac
+++ b/configure.ac
@@ -575,6 +575,52 @@ AM_CONDITIONAL([USE_PCLMUL_CRC32],
 test "x$pclmul_intrinsic_exists" = "xyes"])
 CFLAGS=$ac_save_CFLAGS
 
+AC_MSG_CHECKING([if __get_cpuid_count exists])
+AC_COMPILE_IFELSE(
+  [AC_LANG_SOURCE([[
+#include 
+
+int main(void)
+{
+  unsigned int eax = 0, ebx = 0, ecx = 0, edx = 0;
+  __get_cpuid_count(7, 0, , , , );
+  return 1;
+}
+  ]])
+  ],[
+AC_MSG_RESULT([yes])
+get_cpuid_count_exists=yes
+  ],[
+AC_MSG_RESULT([no])
+  ])
+
+CFLAGS="-mavx2 $CFLAGS"
+AC_MSG_CHECKING([if avx2 intrinstics exists])
+AC_COMPILE_IFELSE(
+  [AC_LANG_SOURCE([[
+#include 
+
+int main(void)
+{
+  __m256i a, b;
+  a = _mm256_sad_epu8(a, b);
+  return 1;
+}
+  ]])
+  ],[
+AC_MSG_RESULT([yes])
+AC_DEFINE([HAVE_AVX2_INTRINSIC], [1], [avx2 intrinsics exists])
+avx2_intrinsic_exists=yes
+  ],[
+AC_MSG_RESULT([no])
+  ])
+if test "x$get_cpuid_count_exists" = "xyes" && test "x$avx2_intrinsic_exists" = "xyes"; then
+  AC_DEFINE([USE_AVX2_WC_LINECOUNT], [1], [Counting lines with AVX2 enabled])
+fi
+AM_CONDITIONAL([USE_AVX2_WC_LINECOUNT], [test "x$get_cpuid_count_exists" = "xyes" && test "x$avx2_intrinsic_exists" = "xyes"])
+
+CFLAGS=$ac_save_CFLAGS
+
 
 
 dnl Autogenerated by the 'gen-lists-of-programs.sh' auxiliary script.
diff --git a/po/POTFILES.in b/po/POTFILES.in
index b5f5bbff1..dc80762db 100644
--- a/po/POTFILES.in
+++ b/po/POTFILES.in
@@ -142,6 +142,7 @@ src/unlink.c
 src/uptime.c
 src/users.c
 src/wc.c
+src/wc_avx2.c
 src/who.c
 src/whoami.c
 src/yes.c
diff --git a/src/local.mk b/src/local.mk
index 8c8479a53..c6555dafb 100644
--- a/src/local.mk
+++ b/src/local.mk
@@ -427,6 +427,15 @@ src_basenc_CPPFLAGS = -DBASE_TYPE=42 $(AM_CPPFLAGS)
 src_expand_SOURCES = src/expand.c src/expand-common.c
 src_unexpand_SOURCES = src/unexpand.c src/expand-common.c
 
+src_wc_SOURCES = src/wc.c
+if USE_AVX2_WC_LINECOUNT
+noinst_LIBRARIES += src/libwc_avx2.a
+src_libwc_avx2_a_SOURCES = src/wc_avx2.c
+wc_avx2_ldadd = src/libwc_avx2.a
+src_wc_LDADD += $(wc_avx2_ldadd)
+src_libwc_avx2_a_CFLAGS = -mavx2 $(AM_CFLAGS)
+endif
+
 # Ensure we don't link against libcoreutils.a as that lib is
 # not compiled with -fPIC which causes issues on 64 bit at least
 src_libstdbuf_so_LDADD = $(LIBINTL)
diff --git a/src/wc.c b/src/wc.c
index 5216db189..1ecec0d83 100644
--- a/src/wc.c
+++ b/src/wc.c
@@ -37,6 +37,9 @@
 #include "safe-read.h"
 #include "stat-size.h"
 #include "xbinary-io.h"
+#ifdef USE_AVX2_WC_LINECOUNT
+#include 
+#endif
 
 #if !defined iswspace && !HAVE_ISWSPACE
 # define iswspace(wc) \
@@ -53,6 +56,15 @@
 /* Size of atomic reads. */
 #define BUFFER_SIZE (16 * 1024)
 
+static
+bool wc_lines(const char *file, int fd, uintmax_t *lines_out, uintmax_t *bytes_out);
+#ifdef USE_AVX2_WC_LINECOUNT
+/* From wc_avx2.c */
+bool wc_lines_avx2(const char *file, int fd, uintmax_t *lines_out, uintmax_t *bytes_out);
+#endif
+bool (*wc_lines_p)(const char *file, int fd, uintmax_t *lines_out, uintmax_t *bytes_out) = wc_lines;
+
+
 /* Cumulative number of lines, words, chars and bytes in all files so far.
max_line_length is the maximum over all files processed so far.  */
 static uintmax_t total_lines;
@@ -108,6 +120,41 @@ static struct option const longopts[] =
   {NULL, 0, NULL, 0}
 };
 
+#ifdef USE_AVX2_WC_LINECOUNT
+static bool
+avx2_supported(void)
+{
+  unsigned int eax = 0;
+  unsigned int ebx = 0;
+  unsigned int ecx = 0;
+  unsigned int edx = 0;
+
+  if (! __get_cpuid(1, , , , ))
+{
+  return false;
+}
+
+  if (! (ecx & bit_OSXSAVE))
+{
+  return false;
+}
+
+  eax = ebx = ecx = edx = 0;
+
+  if (! __get_cpuid_count(7, 0, , , , ))
+{
+  return 

bug#44704: uniq: replace repeated lines with a message about how many repeated lines

2020-11-17 Thread Assaf Gordon

tag 44704 notabug
severity 44704 wishlist
stop

Hello,

On 2020-11-17 6:32 a.m., Brian J. Murrell wrote:

It would be a useful enhancement to uniq to replace all lines
considered non-uniq (i.e. those that would be removed from the output)
with a message about how many times the previous line was repeated.

I.e.

$ cat <
[...]

uniq supports the "--group" option, which adds a blank line after each
group of identical lines - this can be used down-stream to process
groups in any way you want.

Example:
  $ cat < in
  first line
  second line
  repeated line
  repeated line
  repeated line
  repeated line
  repeated line
  third line
  EOF

  $ cat in | uniq --group=append
  first line

  second line

  repeated line
  repeated line
  repeated line
  repeated line
  repeated line

  third line


  $ cat in | uniq --group=append \
  | awk '$0=="" { print "do something after group" ; next } ;
 1 { print }'
  first line
  do something after group
  second line
  do something after group
  repeated line
  repeated line
  repeated line
  repeated line
  repeated line
  do something after group
  third line
  do something after group

And with counting:

$ cat in | uniq --group=append \
 | awk 'BEGIN { c = 0 } ;
$0=="" { print "Group has " c " lines" ; c=0 ; next } ;
1 { print ; c++ }'
  first line
  Group has 1 lines
  second line
  Group has 1 lines
  repeated line
  repeated line
  repeated line
  repeated line
  repeated line
  Group has 5 lines
  third line
  Group has 1 lines


Hope this helps.
More information about "uniq --group=X" is here:

https://www.gnu.org/software/coreutils/manual/html_node/uniq-invocation.html

I'm marking this as "notabug/wishlist", but will likely close soon as
"wontfix" unless we come up with convincing argument why "--group"
is not sufficient for your use case.

Regardless of the status, discussion can continue by replying to this 
thread.


regards,
 - assaf






Re: Enhancement Request: sort: skip table caption (or just a specified number of lines)

2020-11-05 Thread Assaf Gordon

Hello,

On 2020-11-05 10:23 a.m., Michael Mess wrote:

I have a feature request for the sort command:
I would like to sort a table but do not want to sort the column names a
the top. Thus the column names or a specified number of lines should be
just given out as they are, unsorted.


This has been discussed few times in the past,
please see discussion (and with further links to other relevant 
postings) at:

  https://debbugs.gnu.org/cgi/bugreport.cgi?bug=22057




I know there is a workaround, but this is not so handy/comfortable
and has other disadvantages:

for a in ColumnName 4 2 3 1 ; do echo $a ;done | ( sed -u 1q ; sort
-n )



As for handy/comfortable, you can use create a shell alias or a shell 
function, e.g.:


  $ alias sortcap="(sed -u q1 ; sort)"
  $ printf "%s\n" ColumnName 4 2 3 1 | sortcap
  ColumnName
  1
  2
  3
  4

But also (shameless plug), long ago I wrote a perl wrapper
for 'sort' that hides these details and accepts the same parameters as
'sort' with addition of "--header N" argument, similar to your requested 
"--caption".

https://github.com/agordon/bin_scripts/blob/master/scripts/sort-header.pl


Hope this helps.
 - Assaf






bug#43684: Problem with numerical splitting with files > 90*l

2020-09-29 Thread Assaf Gordon




On 29/09/2020 02:18, ned haughton wrote:

When splitting with -d, the numbering screws up after 89:


In addition to Pádraig explanation, please see previous similar 
discussion here:

  https://lists.gnu.org/archive/html/bug-coreutils/2017-02/msg00050.html
  http://bugs.gnu.org/25832

regards,
 - assaf





bug#42340: Fwd: bug#42340: "join" reports that "sort"ed input is not sorted

2020-07-15 Thread Assaf Gordon

Hello,

On 2020-07-15 2:12 p.m., Beth Andres-Beck wrote:

If that is the intended behavior, the bug is that:

printf '12,\n1,\n' | sort -t, -k1 -s

1,
12,

does _not_ take the remainder of the line into account, and only sorts on
the initial field, prioritizing length.

It is at the very least unexpected that adding an `a` to the end of both
lines would change the sort order of those lines:

printf '12,a\n1,a\n' | sort -t, -k1 -s

12,a
1,a



Not a bug, just an incomplete usage :)

sort's -k/--key parameter takes two values (the second being optional):
the first and last column to use as the key. If the second value is 
omitted (as in your case), then the key is taken from the first field

to the end of the line.

And so:
"sort -k1,1" means take the first *and only the first* field as the key.
"sort -k1" means take the first field until the end of the line as the key.
"sort -k1,3" means take the first,second and third fields as the single key.
"sort -k1,1 -k2,2 -k3,3" means take the first field as the first key,
second field as the second key, and third field as the third key.

---

The "--debug" option can help illustrate what sort is doing,
by adding underscore characters to show which characters are being used 
as keys in each line. Consider the following:


   $ printf '12,\n1,\n' | sort -t, -k1 -s --debug
   sort: using ‘en_CA.utf8’ sorting rules
   1,
   __
   12,
   ___

   $ printf '12,\n1,\n' | sort -t, -k1,1 -s --debug
   sort: using ‘en_CA.utf8’ sorting rules
   1,
   _
   12,
   __

In the first example, the "-k1" means from first field till end of line,
the underscore includes the "," characters.
In the second example, the "-k1,1" means only the first field, and the 
comma is not used.


Now consider your second case of adding an "a" at the end of each line:

   $ printf '12,a\n1,a\n' | sort -t, -k1 -s --debug
   sort: using ‘en_CA.utf8’ sorting rules
   12,a
   
   1,a
   ___

   $ printf '12,a\n1,a\n' | sort -t, -k1,1 -s --debug
   sort: using ‘en_CA.utf8’ sorting rules
   1,a
   _
   12,a
   __

In the first example, "-k1" means: from first field until the end of the 
line, and so the entire string "12,a" is compared against "1,a".


**AND**, because the locale is a "utf-8" locale, punctuation characters 
are ignored (as mentioned in the previous email in this thread).

So effectively the compared strings are "12a" vs "1a".
The ASCII value of "2" is smaller than the ASCII value of "a", and
therefore "12a" appears before "1a".

If we force C locale, then the order is reversed:

   $ printf '12,a\n1,a\n' | LC_ALL=C sort -t, -k1 -s --debug
   sort: using simple byte comparison
   1,a
   ___
   12,a
   

Because now punctuation characters are used, and the ASCII value of ","
is smaller than the ASCII value of "2".

**HOWEVER**, this result of using "LC_ALL=C" together with "-k1" is
only correct by a happy accident :)
it is still very likely that "-k1" is not what you wanted - you 
probably meant to do "-k1,1".


---

Lastly, the "-s/--stable" option in the above contrived examples is 
superfluous - it doesn't affect the output order because there are no

equal field values (i.e. "1" vs "12").
A slightly better example to illustrate how "-s" affects ordering is this:

   $ printf "2,x\n1,a\n2,b\n" | sort -t, -k1,1
   1,a
   2,b
   2,x

   $ printf "2,x\n1,a\n2,b\n" | sort -t, -k1,1 -s
   1,a
   2,x
   2,b

Here, "1" comes before "2" - that's obvious. But should "2,b" come 
before "2,x" ?
If we do not use "-s/--stable", then "sort" ALSO does one additional 
comparison of the entire line as a last step (hence "sort --help" says

"[disable] last-resort comparison" about "-s/--stable").
The substring ",b" comes before ",x" - therefore "2,b" appears first.

If we add "-s/--stable", the last comparison step of the entire line is 
skipped, and the lines of "2" appear in the order they were in the input 
(hence - "stable").


By using "--debug" we can see the additional comparison step (indicated 
by additional underscore lines);


   $ printf "2,x\n1,a\n2,b\n" | sort -t, -k1,1 --debug
   sort: using ‘en_CA.utf8’ sorting rules
   1,a
   _
   ___
   2,b
   _
   ___
   2,x
   _
   ___


   $ printf "2,x\n1,a\n2,b\n" | sort -t, -k1,1 -s --debug
   sort: using ‘en_CA.utf8’ sorting rules
   1,a
   _
   2,x
   _
   2,b
   _

---

Hope this helps.
regards,
 - assaf







bug#42340: "join" reports that "sort"ed input is not sorted

2020-07-13 Thread Assaf Gordon

tags 42340 notabug
close 42340
stop

Hello,

On 2020-07-12 5:57 p.m., Beth Andres-Beck wrote:

In trying to use `join` with `sort` I discovered odd behavior: even after
running a file through `sort` using the same delimiter, `join` would still
complain that it was out of order.

[...]

Here is a way to reproduce the problem:


printf '1.1.1,2\n1.1.12,2\n1.1.2,1' | sort -t, > a.txt
printf '1.1.12,a\n1.1.1,b\n1.1.21,c' | sort -t, > b.txt
join -t, a.txt b.txt

  join: b.txt:2: is not sorted: 1.1.1,b

The expected behavior would be that if a file has been sorted by "sort" it
will also be considered sorted by join.

[...]
I traced this back to what I believe to be a bug in sort.c 


This is not a bug in sort or join, just a side-effect of the locale on 
your system on the sorting results.


By forcing a C locale with "LC_ALL=C" (meaning simple ASCII order),
the files are ordered in the same way 'join' expected them to be:

 $ printf '1.1.1,2\n1.1.12,2\n1.1.2,1' | LC_ALL=C sort -t, > a.txt
 $ printf '1.1.12,a\n1.1.1,b\n1.1.21,c' | LC_ALL=C sort -t, > b.txt
 $ join -t, a.txt b.txt
 1.1.1,2,b
 1.1.12,2,a

---

More details:
I'm going to assume your system uses some locale based on UTF-8.
You can check it by running 'locale', e.g. on my system:
  $ locale
  LANG=en_CA.utf8
  LANGUAGE=en_CA:en
  LC_CTYPE="en_CA.utf8"
  ..
  ..

Under most UTF-8 locales, punctuation characters are *ignored* in the
compared input lines. This might be confusing and non-intuitive, but
that's the way most systems have been working for many years (locale
ordering is defined in the GNU C Library, and coreutils has no way to
change it).

Observe the following:

  $ printf '12,a\n1,b\n' | LC_ALL=en_CA.utf8 sort
  12,a
  1,b

  $ printf '12,a\n1,b\n' | LC_ALL=C sort
  1,b
  12,a

With a UTF-8 locale, the comma character is ignored, and then "12a" 
appears before "1b" (since the character '2' comes before the character

'b').

With "C" locale, forcing ASCII or "byte comparison", punctuation 
characters are not ignored, and "1,b" appears before "12,a" (because
the comma ',' ASCII value is 44	, which is smaller then the ASCII value 
digit '2').


---

Somewhat related:
Your sort command defines the delimiter ("-t,") but does not define 
which columns to sort by; sort then uses the entire input line - and 
there's no need to specify delimiter at all.


---

As such, I'm closing this as "not a bug", but discussion can continue by
replying to this thread.

regards,
 - assaf






Re: [PATCH] maint: add sm3sum based on OSCCA SM3 secure hash

2020-06-09 Thread Assaf Gordon

Hello,

On 2020-06-09 12:23 a.m., Tianjia Zhang wrote:

Add message digest program sm3sum, it use OSCCA SM3 secure
hash (OSCCA GM/T 0004-2012 SM3) generic hash transformation.



There has already been a discussion about adding SM3 to coreutils three
years ago, and it was decided against adding it:

https://lists.gnu.org/archive/html/coreutils/2017-10/msg00043.html


regards,
 - assaf



Re: [PATCH] md5sum: add an option to change directory

2020-05-30 Thread Assaf Gordon

Hello,

On 2020-05-30 3:59 p.m., Bertrand Jacquin wrote:
[...]

This definitely make sense


   $ sha256sum -C /etc fstab
   b5d6c0e5e6bc419b134478ad7b3e7c8cc628049876a7772cea469e81e4b0e0e5
fstab


The net effect is that just the output has changed to omit the path
name.

Maybe this wants to be a --strip or -p option like with diff or patch,
or --basename-only to strip a variable number of components, leaving
only
the last.


This seems to be a better approach indeed. I just sent a new patch using
base_name from coreutils itself.


The GNU Datamash program can do basename and dirname on a column of a 
text file, producing the wanted results (and more):


 $ md5sum /etc/fstab world.txt | datamash -W --full basename 2 dirname 2
 b50f98cdf2d6e26a99040ad5386b0884  /etc/fstab  fstab  /etc
 b1946ac92492d2347c6235b4d2611184  world.txt   world.txt  .

And this will work on any input without the need to duplicate
functionality in multiple programs.



-assaf



Re: Extend uniq to support unsorted list based on hashtable

2020-05-29 Thread Assaf Gordon

Hello,

On 2020-05-29 10:16 p.m., Yair Lenga wrote:

Wanted to suggest that the team will look (again) at implementing
--unsorted option for 'uniq'.

The idea was proposed (and rejected) about 10 years ago
(https://lists.gnu.org/archive/html/coreutils/2011-11/msg00016.html).
Lot of things have changed from the past.


[...]


Can you advise/provide feedback. I'm sure that there will be many
volunteers (me included) to contribute to such important improvement.


"uniq" is standardize by POSIX to work on "comparing adjacent lines"
(from: 
https://pubs.opengroup.org/onlinepubs/9699919799/utilities/uniq.html ) - 
hence the requirement to pre-sort the input.


While it could be extended with a completely different hash-based
implementation, I don't think this is likely to happen.

As an alternative (and a shameless plug), allow me to point to
GNU Datamash ( https://www.gnu.org/software/datamash/ ).
On one hand, it already has a hash-based implementation to
remove duplicated fields (called "rmdup").
consider the following contrived example:

  $ (printf "%s\t%s\n" 9 B 3 A ; seq 10 | paste - -) | datamash rmdup 1
  9 B
  3 A
  1 2
  5 6
  7 8

And on the other hand, because 'datamash' is non-standard,
there's less of a problem in adding new functionality (i.e. "bloat" is
not as big as a concern as it is for coreutils).

Hope this helps.

regards,
 - assaf





Re: [PATCH] md5sum: add an option to change directory

2020-05-20 Thread Assaf Gordon

Hello,

On 2020-05-20 3:15 p.m., Bertrand Jacquin wrote:

In the fashion of make and git, add the ability for all sum tools to
change directory before reading a file.

[...]

   $ sha256sum -C /etc fstab
   b5d6c0e5e6bc419b134478ad7b3e7c8cc628049876a7772cea469e81e4b0e0e5  fstab


I'm not entirely sure what is the use case,

but GNU "env(1)" already has a '-C/--chdir' option which does
exactly what you want (since version 8.28 / released 2017):

   env -C etc sha256sum fstab

Or the (longer) shell construct:

   (cd etc && sha256sum fstab)



regards,
 - assaf



Re: suggestion: /etc/dd.conf

2020-04-29 Thread Assaf Gordon

Hello,

On 2020-04-28 3:14 a.m., turgut kalfaoğlu wrote:
I would like to suggest and in fact volunteer to create a conf file 
option to 'dd'.


Adding to others replies,
similar suggestions for Coreutils configuration files have been
discussed in the past, and rejected:
  https://www.gnu.org/software/coreutils/rejected_requests.html#gnuconf


regards,
 - assaf



Re: decorate - new sorting-helper program (experimental)

2020-04-24 Thread Assaf Gordon

Hello,

Just a quick note that the "decorate" program (explained below)
was just released as part of GNU datamash 1.7:

   https://lists.gnu.org/archive/html/info-gnu/2020-04/msg00011.html

Comments, suggestions and feedback are very welcomed.



On 2020-04-13 1:14 p.m., Assaf Gordon wrote:

Hello,

I'm happy to announce the first experimental release of the "decorate"
program.

'decorate' works in tandem with coreutils' sort(1) to allow new sorting
methods (e.g. IP addresses, roman numerals, string lengths).

This is a new program but an old idea, suggested by Pádraig here:
  https://lists.gnu.org/r/bug-coreutils/2015-06/msg00076.html

---

The program is part of the "datamash" package, and available here:
   https://alpha.gnu.org/gnu/datamash/datamash-1.5.17-735b.tar.gz

"./configure && make" should give you the "decorate" executable.

The rest of this (long) email shows usage information and examples.

This is an experimental version, and everything could still change.

Comments, suggestions and feedback are *very* welcomed.

regards,
  - assaf




 General Usage #

The general idea is:
1. convert a field of an input file to a format  that can be easily
    sorted by sort(1), e.g., converting roman numerals
    to their decimal equivalent or IPv4 addresses to 32 bit hex value.
2. Pass this converted (=decorated) input to sort
3. remove (=undecorate) the converted fields.

Example 1:

   ### convert roman-numerals, add new field
   $ printf "%s\n" C V III IX XI | ./decorate -k1,1:roman --decorate
   00100 C
   5 V
   3 III
   9 IX
   00011 XI

   ### combine decorate-sort-undecorate
   $ printf "%s\n" C V III IX XI \
    | ./decorate -k1,1:roman --decorate \
    | sort -k1,1  \
    | ./decorate --undecorate 1
   III
   V
   IX
   XI
   C


 Easy/automatic 'decorate-sort-undecorate' method 

Since the decorate-sort-undecorate pattern is repetitive,
the "decorate" program can execute 'decorate + sort + undecorate' 
automatically (forking + piping to sort and back).


This is done when "--decorate" and "--undecorate" arguments are *not* 
specified (i.e. - decorate is used as a 'sort' wrapper):


   $ printf "%s\n" C V III IX XI | ./decorate -k1,1:roman
   III
   V
   IX
   XI
   C


 Conversions Syntax #

The -k/--key specification follows sort(1), with the addition
of allowing a conversion function name following ":" (colons).

Examples:

   $ printf "MMXX III\n" | ./decorate --decorate -k1,1:roman
   02020 MMXX III
   $ printf "MMXX III\n" | ./decorate --decorate -k1.2,1:roman
   01020 MMXX III
   $ printf "MMXX III\n" | ./decorate --decorate -k1,1:strlen
   4 MMXX III
   $ printf "MMXX III\n" | ./decorate --decorate -k1:strlen
   8 MMXX III

The "r" (=reverse) flag can also be used:

   $ printf "%s\n" X I IV IX VI | ./decorate -k1,1:roman
   I
   IV
   VI
   IX
   X

   $ printf "%s\n" X I IV IX VI | ./decorate -k1,1r:roman
   X
   IX
   VI
   IV
   I

Available conversions methods:
   as-is    copy as-is
   roman    roman numerals
   strlen   length (in bytes) of the specified field
   ipv4 dotted-decimal IPv4 addresses
   ipv6 IPv6 addresses
   ipv4inet number-and-dots IPv4 addresses (incl. octal, hex values)

Examples:

   $ printf "%s\n" 10.2.3.4  8.9.7.3 | ./decorate --decorate -k1,1:ipv4
   0A020304 10.2.3.4
   08090703 8.9.7.3

   $ printf "%s\n" 10.010.0x10.10 192.168 \
     | ./decorate --decorate   -k1,1:ipv4inet
   0A08100A 10.010.0x10.10
   C0A8 192.168

   $ printf "%s\n" :: 2000::1234 :::192.168.1.42 \
   | ./decorate --decorate -k1,1:ipv6
   ::::::: ::
   2000:::::::1234 2000::1234
   ::::::C0A8:012A :::192.168.1.42


 Mixing -k/--key for decorating and sorting 

When 'decorate' automatically runs sort(1), any keys
that are not used for decoration are passed to 'sort'
(after being adjusted for the right column).

Example:

   $ printf "%-2s %d\n" C 4 IC 1 I 107 II 4 C 31 I 19 \
   | ./decorate -k1,1:roman -k2nr,2
   I  107
   I  19
   II 4
   IC 1
   C  31
   C  4


   $ printf "%-2s %d\n" C 4 IC 1 I 107 II 4 C 31 I 19 \
     | ./decorate -k2n,2 -k1,1:roman
   IC 1
   II 4
   C  4
   I  19
   C  31
   I  107

To better understand what parameters are passed to sort(1),
use "--print-sort-args" (which only prints the arguments to be used
with sort(1) but does not decorate or sort the input):

Here, "decorate" knows that a new field 

Re: decorate - new sorting-helper program (experimental)

2020-04-14 Thread Assaf Gordon

Hello Bernhard,

Thanks for the feedback and thanks trying it (or trying to try it :) ).

On 2020-04-14 12:51 a.m., Bernhard Voelker wrote:

On 2020-04-13 21:14, Assaf Gordon wrote:

I'm happy to announce the first experimental release of the "decorate"
program.



The program is part of the "datamash" package, and available here:
https://alpha.gnu.org/gnu/datamash/datamash-1.5.17-735b.tar.gz


I'm a bit confused.
I've just pulled from 'git://git.sv.gnu.org/datamash.git', but the
decorate sources are not there yet, but instead, there's a 'v1.6' tag
which doesn't fit into above's "1.5.17-735b" versioning.
Do you push somewhere else?



Sorry about that, it's in such a preliminary state that I didn't
want to push it yet (certainly not to the master branch).

I added now a new "decorate" branch, which contains the (still messy)
code. It was branched off a version prior to 1.6, hence the
version issue.

To test it please try:
   git clone -b decorate git://git.sv.gnu.org/datamash.git

Once it stabilizes I will of course clean it, squash it, and push it to
the "master" branch.

regards,
 - assaf





decorate - new sorting-helper program (experimental)

2020-04-13 Thread Assaf Gordon

Hello,

I'm happy to announce the first experimental release of the "decorate"
program.

'decorate' works in tandem with coreutils' sort(1) to allow new sorting
methods (e.g. IP addresses, roman numerals, string lengths).

This is a new program but an old idea, suggested by Pádraig here:
 https://lists.gnu.org/r/bug-coreutils/2015-06/msg00076.html

---

The program is part of the "datamash" package, and available here:
  https://alpha.gnu.org/gnu/datamash/datamash-1.5.17-735b.tar.gz

"./configure && make" should give you the "decorate" executable.

The rest of this (long) email shows usage information and examples.

This is an experimental version, and everything could still change.

Comments, suggestions and feedback are *very* welcomed.

regards,
 - assaf




 General Usage #

The general idea is:
1. convert a field of an input file to a format  that can be easily
   sorted by sort(1), e.g., converting roman numerals
   to their decimal equivalent or IPv4 addresses to 32 bit hex value.
2. Pass this converted (=decorated) input to sort
3. remove (=undecorate) the converted fields.

Example 1:

  ### convert roman-numerals, add new field
  $ printf "%s\n" C V III IX XI | ./decorate -k1,1:roman --decorate
  00100 C
  5 V
  3 III
  9 IX
  00011 XI

  ### combine decorate-sort-undecorate
  $ printf "%s\n" C V III IX XI \
   | ./decorate -k1,1:roman --decorate \
   | sort -k1,1  \
   | ./decorate --undecorate 1
  III
  V
  IX
  XI
  C


 Easy/automatic 'decorate-sort-undecorate' method 

Since the decorate-sort-undecorate pattern is repetitive,
the "decorate" program can execute 'decorate + sort + undecorate' 
automatically (forking + piping to sort and back).


This is done when "--decorate" and "--undecorate" arguments are *not* 
specified (i.e. - decorate is used as a 'sort' wrapper):


  $ printf "%s\n" C V III IX XI | ./decorate -k1,1:roman
  III
  V
  IX
  XI
  C


 Conversions Syntax #

The -k/--key specification follows sort(1), with the addition
of allowing a conversion function name following ":" (colons).

Examples:

  $ printf "MMXX III\n" | ./decorate --decorate -k1,1:roman
  02020 MMXX III
  $ printf "MMXX III\n" | ./decorate --decorate -k1.2,1:roman
  01020 MMXX III
  $ printf "MMXX III\n" | ./decorate --decorate -k1,1:strlen
  4 MMXX III
  $ printf "MMXX III\n" | ./decorate --decorate -k1:strlen
  8 MMXX III

The "r" (=reverse) flag can also be used:

  $ printf "%s\n" X I IV IX VI | ./decorate -k1,1:roman
  I
  IV
  VI
  IX
  X

  $ printf "%s\n" X I IV IX VI | ./decorate -k1,1r:roman
  X
  IX
  VI
  IV
  I

Available conversions methods:
  as-iscopy as-is
  romanroman numerals
  strlen   length (in bytes) of the specified field
  ipv4 dotted-decimal IPv4 addresses
  ipv6 IPv6 addresses
  ipv4inet number-and-dots IPv4 addresses (incl. octal, hex values)

Examples:

  $ printf "%s\n" 10.2.3.4  8.9.7.3 | ./decorate --decorate -k1,1:ipv4
  0A020304 10.2.3.4
  08090703 8.9.7.3

  $ printf "%s\n" 10.010.0x10.10 192.168 \
| ./decorate --decorate   -k1,1:ipv4inet
  0A08100A 10.010.0x10.10
  C0A8 192.168

  $ printf "%s\n" :: 2000::1234 :::192.168.1.42 \
  | ./decorate --decorate -k1,1:ipv6
  ::::::: ::
  2000:::::::1234 2000::1234
  ::::::C0A8:012A :::192.168.1.42


 Mixing -k/--key for decorating and sorting 

When 'decorate' automatically runs sort(1), any keys
that are not used for decoration are passed to 'sort'
(after being adjusted for the right column).

Example:

  $ printf "%-2s %d\n" C 4 IC 1 I 107 II 4 C 31 I 19 \
  | ./decorate -k1,1:roman -k2nr,2
  I  107
  I  19
  II 4
  IC 1
  C  31
  C  4


  $ printf "%-2s %d\n" C 4 IC 1 I 107 II 4 C 31 I 19 \
| ./decorate -k2n,2 -k1,1:roman
  IC 1
  II 4
  C  4
  I  19
  C  31
  I  107

To better understand what parameters are passed to sort(1),
use "--print-sort-args" (which only prints the arguments to be used
with sort(1) but does not decorate or sort the input):

Here, "decorate" knows that a new field will be added
(the converted roman numerals), and so the "-k2nr,2"
is adjusted to be "-k3,3nr":

  $ ./decorate --print-sort-args -k1,1:roman -k2nr,2
  sort -k1,1 -k3,3nr

Here, "decorate" will add two fields (first ipv4 from field 2,
and roman numerals from field 3). The "-k5,5V" is adjusted
to be "-k7,7V":

  $ ./decorate --print-sort-args -k5,5V -k2,2:ipv4 -k3,3:roman
  sort -k7,7V -k1,1 -k2,2


 Other sort(1) parameters 

When 'decorate' automatically runs sort(1), several common sort(1)
options are accepted and passed as-is to sort.

Example:

$ ./decorate --print-sort-args -k2,2:ipv4 \
 --stable \
   

bug#40530: feature proposal: coreutils -> sort: adding sorting ability for Hebrew numerals

2020-04-09 Thread Assaf Gordon
Hello,

> On Apr 9, 2020, at 3:23 PM, Zeev Pekar  wrote:
> 
> it would be nice to be able to sort (coreutils -> sort) Hebrew numerals:

An interesting idea, but I think it is a bit too niche to be included in the 
coreutils “sort” program (tradeoff of usefulness vs bloat).

However, such functionality is very suitable to an old idea of an auxiliary 
“decorate” program that will allow many more sorting options when used in 
tandem with “sort”.

I’ve started writing such program some time ago, based on  Pádraig's idea 
(never completed, but perhaps these days are perfect opportunity to complete 
it):
https://lists.gnu.org/archive/html/coreutils/2019-03/msg00056.html

Would you like to try your hand at coding the sorting rules for such 
Hebrew-numerals sort?

regards,
 - Assaf



Re: altchars for base64

2020-03-15 Thread Assaf Gordon

Hello,

On 2020-03-15 12:12 a.m., Kaz Kylheku (Coreutils) wrote:

On 2020-03-14 22:20, Peng Yu wrote:

Python base64 decoder has the altchars option.

[...]

But I don't see such an option in coreutils' base64. Can this option
be added? Thanks.


# use %* instead of +/:
base64 whatever | tr '+/' '%*'


The reason for alternative characters is typically do use then in URLs,
where "/" and "+" are problematic.

A new command "basenc" was introduced in coreutils version 8.31
(released last year) which supports multiple encodings.
One of these is a "web-safe" variant of base64, as defined in
RFC4648 section 5:

  $ printf '\376\117\202' | basenc --base64
  /k+C

  $ printf '\376\117\202' | basenc --base64url
  _k-C

regards,
 - assaf

P.S.
The other supported encodings are (basenc --help):

  --base64  same as 'base64' program (RFC4648 section 4)
  --base64url   file- and url-safe base64 (RFC4648 section 5)
  --base32  same as 'base32' program (RFC4648 section 6)
  --base32hex   extended hex alphabet base32 (RFC4648 section 7)
  --base16  hex encoding (RFC4648 section 8)
  --base2msbf   bit string with most significant bit (msb) first
  --base2lsbf   bit string with least significant bit (lsb) first







Re: Suggestion: Keep headings when sorted

2020-01-24 Thread Assaf Gordon

Hello,

On 2020-01-21 2:14 a.m., Mattias Johansson wrote:

I often find that I want to keep one or a few lines untouched by sort, and end 
up using something like this:

$ awk 'NR == 1; NR > 1 { print $0 | "sort" }'

It would be handy if sort had an option for 'number of heading lines' or 
similar!

I imagine something like this:

$ sort -H  # keeps first line in place while sorting the rest


Adding "skip-header" support to GNU sort has been requested and
discussed several times in the past (including by me, seven years
ago...).

The decision was that such functionality can be easily achieved
using existing tools.

For some more details and past discussions, please see:
https://www.gnu.org/software/coreutils/rejected_requests.html#sort
https://lists.gnu.org/archive/html/coreutils/2013-01/msg00027.html
https://lists.gnu.org/archive/html/coreutils/2014-11/msg00022.html
https://lists.gnu.org/archive/html/coreutils/2015-10/msg00102.html

---

The simplest method is:

$ ANY-PROGRAM | ( sed -u 1q ; sort )

This is slightly simpler and shorter than the above "awk" method.
It requires GNU sed for the "-u/--unbuffered" option.


The above sed+sort invocation can be made into a shell function:

  sorth() { sed -u 1q ; sort "$@"; }

And then use "sorth" instead of "sort"
(nothing the main difference is that "sort" can take input files on the
command line, while "sorth" must take the input from STDIN).

---

Change "1q" to "3q" or other values to keep more than one line of
headers at the top of the input.

The above shell function can be improved into:

   sorth() { num=$1 ; shift ; sed -u ${num}q ; sort "$@"; }

To accept the number of header lines as a (required) first parameter,
e.g. the following will keep the first the values intact and randomize 
the remaining 7:


   seq 10 | sorth 3 -R

---

If all else fails, and such a sort-header program is still needed,
I can offer my own attempt at such a perl-wrapper script,
which I wrote before knowing about the "sed/sort" method:
 https://github.com/agordon/bin_scripts/blob/master/scripts/sort-header.pl

Hope this helps,
regards,
  Assaf






Re: base64 utilty Question

2020-01-03 Thread Assaf Gordon

Hello,

On 2020-01-03 1:00 p.m., Bahubali Y wrote:

I have question about base64. If I have "LF" as line terminator will that me converted to 
"CRLF" in base64 encoding ?.


Generally no.
GNU base64 preserves the input exactly.

Example:

  $ printf "hello\n" | base64 | base64 -d | od -tx1c -An
68  65  6c  6c  6f  0a
 h   e   l   l   o  \n



I observed above case in my usage


Perhaps another part of your processing converts
the new line characters? esp. if you are using Windows.

If you can provide a succinct reproducible example,
that would help in diagnosing the issue.

regards,
 - assaf







Re: [PATCH] ls: support --time=creation to show/sort birth time

2020-01-02 Thread Assaf Gordon

Hello,

On 2020-01-02 2:01 p.m., Pádraig Brady wrote:

On 02/01/2020 20:29, Assaf Gordon wrote:

Regarding "fall back to mtime", I'm seeing the following results
on some systems - not necessarily a bug, but perhaps it's worth
knowing what to expect:

* Debian 10/x86_64, Linux Kernel 4.19.0, glibc 2.28-10,
with ext2 file system (not supporting birthtime):

    $ ./src/ls -l --time=birth /tmp/dummy-ext2/2
    -rw-r--r-- 1 root root 0 Dec 31  1969 /tmp/dummy-ext2/2


Hmm. That suggests that STATX_BTIME is set in the
returned statx mask, but populated with 0 in the structure
({-1,-1} would have printed as '?'). Though you say src/stat
prints '-' in all cases, and the logic should be much the same.
Could you confirm the birth time significant returns for this case.
epoch isn't a bad time to output in this case, but it would
be good to be consistent.



Indeed, the returned "btime" is zero:


$ strace -v -e trace=statx ./src/stat /tmp/dummy-ext2/2
statx(AT_FDCWD, "/tmp/dummy-ext2/2", 
AT_STATX_SYNC_AS_STAT|AT_SYMLINK_NOFOLLOW, STATX_ALL, 
{stx_mask=STATX_BASIC_STATS, stx_blksize=1024, stx_attributes=0, 
stx_nlink=1, stx_uid=0, stx_gid=0, stx_mode=S_IFREG|0644, stx_ino=12, 
stx_size=0, stx_blocks=0, 
stx_attributes_mask=STATX_ATTR_COMPRESSED|STATX_ATTR_IMMUTABLE|STATX_ATTR_APPEND|STATX_ATTR_NODUMP|STATX_ATTR_ENCRYPTED, 
stx_atime={tv_sec=1577995860, tv_nsec=0} /* 2020-01-02T13:11:00-0700 */, 
stx_btime={tv_sec=0, tv_nsec=0}, stx_ctime={tv_sec=1577995860, 
tv_nsec=0} /* 2020-01-02T13:11:00-0700 */, stx_mtime={tv_sec=1577995860, 
tv_nsec=0} /* 2020-01-02T13:11:00-0700 */, stx_rdev_major=0, 
stx_rdev_minor=0, stx_dev_major=7, stx_dev_minor=0}) = 0

  File: /tmp/dummy-ext2/2
  Size: 0   Blocks: 0  IO Block: 1024   regular empty file
Device: 700h/1792d  Inode: 12  Links: 1
Access: (0644/-rw-r--r--)  Uid: (0/root)   Gid: (0/root)
Access: 2020-01-02 13:11:00.0 -0700
Modify: 2020-01-02 13:11:00.0 -0700
Change: 2020-01-02 13:11:00.0 -0700
 Birth: -
+++ exited with 0 +++

$ strace -v -e trace=statx ./src/ls -l --time=birth /tmp/dummy-ext2/2
statx(AT_FDCWD, "/tmp/dummy-ext2/2", 
AT_STATX_SYNC_AS_STAT|AT_SYMLINK_NOFOLLOW, 
STATX_MODE|STATX_NLINK|STATX_UID|STATX_GID|STATX_SIZE|STATX_BTIME, 
{stx_mask=STATX_BASIC_STATS, stx_blksize=1024, stx_attributes=0, 
stx_nlink=1, stx_uid=0, stx_gid=0, stx_mode=S_IFREG|0644, stx_ino=12, 
stx_size=0, stx_blocks=0, 
stx_attributes_mask=STATX_ATTR_COMPRESSED|STATX_ATTR_IMMUTABLE|STATX_ATTR_APPEND|STATX_ATTR_NODUMP|STATX_ATTR_ENCRYPTED, 
stx_atime={tv_sec=1577995860, tv_nsec=0} /* 2020-01-02T13:11:00-0700 */, 
stx_btime={tv_sec=0, tv_nsec=0}, stx_ctime={tv_sec=1577995860, 
tv_nsec=0} /* 2020-01-02T13:11:00-0700 */, stx_mtime={tv_sec=1577995860, 
tv_nsec=0} /* 2020-01-02T13:11:00-0700 */, stx_rdev_major=0, 
stx_rdev_minor=0, stx_dev_major=7, stx_dev_minor=0}) = 0

-rw-r--r-- 1 root root 0 Dec 31  1969 /tmp/dummy-ext2/2
+++ exited with 0 +++


Looking closer at the new ls.c code (** are added for emphasis):
---
do_statx (int fd, const char *name, struct stat *st, int flags, 

  unsigned int mask) 

{ 

  struct statx stx; 

**bool want_btime = mask & STATX_BTIME; 

  int ret = statx (fd, name, flags, mask, ); 

  if (ret >= 0) 

{ 

  statx_to_stat (, st); 

  /* Since we only need one timestamp type, 

 store birth time in st_mtim.  */ 

**if (mask & STATX_BTIME) 

st->st_mtim = statx_timestamp_to_timespec (stx.stx_btime); 

**   else if (want_btime) 

st->st_mtim.tv_sec = st->st_mtim.tv_nsec = -1; 

} 




  return ret; 


---

Wouldn't "mask & STATX_BTIME" always be the same as "want_btime",
resulting in the "else if" part never to be executed?

IIUC, "mask" is the requested bitmask.

Comparing with "stat.c:do_stat()", I see:

...
statx_to_stat (, ); 

if (stx.stx_mask & STATX_BTIME) 

   pa.btime = statx_timestamp_to_timespec (stx.stx_btime); 


...

Which I recon is the returned bitmask (vs requested bitmask).

Could that be the issue ?


-assaf










Re: [PATCH] ls: support --time=creation to show/sort birth time

2020-01-02 Thread Assaf Gordon

Hello Pádraig and all,


On 2020-01-02 10:48 a.m., Pádraig Brady wrote:

+  ls now supports the --time=birth option to display and sort by
+  file creation time, where available.


+1

Patch looks good, builds and passes the test on Debian 10/x86_64,
OpenBSD 6.6, FreeBSD 12.1, Alpine Linux, and Cygwin-10/64bit on 
Windows7/NTFS.


A suggestion:


  static char const *const time_args[] =
  {
-  "atime", "access", "use", "ctime", "status", NULL
+  "atime", "access", "use",
+  "ctime", "status",
+  "birth", "creation",
+  NULL
  };
  static enum time_type const time_types[] =
  {
-  time_atime, time_atime, time_atime, time_ctime, time_ctime
+  time_atime, time_atime, time_atime,
+  time_ctime, time_ctime,
+  time_btime, time_btime,
  };


Perhaps add "btime" and "crtime" as aliases to birth time?
"btime" is for completion with atime/ctime.
"crtime" is used/mentioned in some contexts (e.g. in "debugfs").


+/* Return the platform birthtime member of the stat structure,
+   or fallback to the mtime member, which we have populated
+   from the statx structure where supported.  */


Regarding "fall back to mtime", I'm seeing the following results
on some systems - not necessarily a bug, but perhaps it's worth
knowing what to expect:

* Debian 10/x86_64, Linux Kernel 4.19.0, glibc 2.28-10,
with ext2 file system (not supporting birthtime):

  $ ./src/ls -l --time=birth /tmp/dummy-ext2/2
  -rw-r--r-- 1 root root 0 Dec 31  1969 /tmp/dummy-ext2/2

(I guess this is unix-epoch adjusted for my local time zone)


* Alpine Linux, Kernel 4.19.80, musl-libc 1.1.22:

$ ./src/ls -l --time=birth README
-rw-r--r-- 1 miles miles 10778? README


* OpenBSD 6.6 on "ffs" type file system:

$ ./src/ls -l --time=birth README
   -rw-r--r-- 1 miles miles 10778? README


On all the above systems, running "./src/stat" correctly
shows "birth: -" .

regards,
 - assaf




Re: Decimal time support in 'date'

2019-12-13 Thread Assaf Gordon
Hello,


On Thu, Dec 12, 2019 at 6:57 PM za3k--- via GNU coreutils General
Discussion  wrote:
>
> I am interested in adding support for decimal time to 'date', but before
> I dive into writing a patch, I wanted to ask whether the patch has a
> chance of being accepted--this may just be too obscure.

Thank you for the suggestion and for checking in first - that's an
excellent approach.

> In decimal time, 2019-12-12.75 would represent 2019-12-12T18:00:00.
> Decimal time in the modern era is mainly used in timekeeping (to track
> employee or contracting hours) and in scientific recording (to make
> drawing graphs easy). Astronomers use another form of decimal time on
> their own calendar and would not be supported.

This is an interesting idea, certainly worth discussing.
When such format is used by time-keepers or scientific recording, is
it being used
on the command-line or from a shell script? or is this more commonly
done in a higher-level programming language?
Can you expand on the other format used by Astronomers?

---

Before going further, please be aware that in order for such patch to
be accepted (or even evaluated), we'll need a copyright assignment
from you (and, potentially, from your employer or university, if you
implement it as part of work/school project).

To learn more, please see here:
  https://www.gnu.org/licenses/why-assign.en.html
To start the process, please fill the following form and send it to
ass...@gnu.org :
  
https://git.savannah.gnu.org/cgit/gnulib.git/tree/doc/Copyright/request-assign.future
(for the program/package question, please fill both "coreutils" and "gnulib")
---

On the technical side,
I expect such a patch to modify mainly gnulib's nstrftime.c module:
https://git.savannah.gnu.org/cgit/gnulib.git/tree/lib/nstrftime.c
If we consider adding a new letter operator (e.g. "%X" ) we should
make sure it does not conflate with any existing letters, including on
non-gnu implementations (e.g. on BSDs).

regards,
  - assaf



bug#38003: date --date=-1month gives same month today

2019-10-31 Thread Assaf Gordon

tag 38003 notabug
close 38003
stop

Hello,

On 2019-10-31 2:34 a.m., Ilja Honkonen wrote:
Please CC me as I'm not on this list. Running date (GNU coreutils) 8.26 
on fedora 30 today (date --utc  -I: 2019-10-31) with --date=-1month 
gives the same month which doesn't make sense:

$ date --utc -I --date=-1month
2019-10-01


date gained a "--debug" option that helps diagnosing the issue:

$ date --utc -I --debug --date=-1month
date: parsed relative part: -1 month(s)
[...]
date: using current date as starting value: '(Y-M-D) 2019-10-31'
[...]
date: warning: when adding relative months/years, it is recommended to 
specify the 15th of the months   <

date: after date adjustment (+0 years, -1 months, +0 days),
date: new date/time = '(Y-M-D) 2019-10-01 17:29:20'
date: warning: month/year adjustment resulted in shifted dates:
date:  adjusted Y M D: 2019 09 31<
date:normalized Y M D: 2019 10 01<
[...]
date: final: (Y-M-D) 2019-10-01 17:29:20 (UTC)
2019-10-01

--

Subtracting 1 month from October 31st results in September 31st.
Since the date doesn't exist, it is normalized:
September 31st is "one day after September 30th", which
results in October 1st.

The "--debug" option also warns: when subtracting months,
it is recommended to specify the 15th (middle) of the month,
exactly to avoid such issues.

   $ date --utc -I --date="2019-10-15 -1month"
   2019-09-15

regards,
 - assaf






Re: How to implement the V comparsion used by sort in python?

2019-10-28 Thread Assaf Gordon

Hello,

On 2019-10-28 3:00 a.m., Florian Weimer wrote:

* Assaf Gordon:

On Oct 26, 2019, at 5:05 PM, Peng Yu  wrote:

Are you sure they are 100% compatible with V? I don’t want to use
them just later find they are not 100% compatible.


There are no such guarantees, especially not with free software.


I don't know why you say that.


Perhaps my writing wasn't clear enough.

What I meant was: *I* can not provide any such guarantees
(since the question was "are *you* sure").

I can't speak for other coreutils maintainers (or the people who
wrote the gnulib version-compare module), but I highly suspect
that they will also not be willing to guarantee such %100 compatibility.

As for the "free software" part - (almost?) every free software license
explicitly mentions that the software comes with no warranty what so
ever. Typically the license include the phrase "[no] FITNESS FOR A 
PARTICULAR PURPOSE" - meaning that even there is some implied purpose

(such as sorting 'naturally' for "sort -V"), there is no guarantee
it is even fit for that purpose.

In practice, it means that even if I (or others) took a cursory look
at both "sort -V" and the mentioned python package and deemed them 
"compatible", there is still *no* guarantees they are actually 100%

compatible. There could always be a bug or an unexpected result.

It seemed to me the OP wanted some very strong guarantees regarding
that code that would save him time and effort, without investing time
or other resources to do the testing themselves.
To that, my answer was "no such guarantees".

If my previous answer was too brief, I hope this clarifies it.


But someone certainly has to do this work.


I completely agree.

If the OP wants reasonable assurance they are compatibly,
they can read the details about "sort -V" and invest the time
and effort in comparing it to the python package algorithm.

Or for stronger guarantees - perhaps they can consider hiring someone
to do a very thorough investigation and provide them with some
concrete guarantees.

regards,
 - assaf



Re: How to implement the V comparsion used by sort in python?

2019-10-27 Thread Assaf Gordon
Hello,

> On Oct 26, 2019, at 5:05 PM, Peng Yu  wrote:
> 
> Are you sure they are 100% compatible with V? I don’t want to use them just 
> later find they are not 100% compatible.

There are no such guarantees, especially not with free software.

The details I previously sent to you ( 
https://lists.gnu.org/archive/html/coreutils/2019-10/msg2.html ) explain 
any differences between “sort -V” and debian’s dpkg/apt algorithm, which is 
what the mentioned python package implements.

You’ll have to go some work yourself to determine whether these differences 
affect your desired outcome.

regards,
  - Assaf





Re: How to implement the V comparsion used by sort in python?

2019-10-26 Thread Assaf Gordon
Hello,


> On Oct 25, 2019, at 8:00 PM, Peng Yu  wrote:
> 
> 
> I'd like to mimic the V sort order in python. Is there any easy to use
> comparison available in python?

A simple online search will show several python packages that can do it.
For example:
https://deb-pkg-tools.readthedocs.io/en/latest/api.html#module-deb_pkg_tools.version

-assaf 

Re: Does head util cause SIGPIPE?

2019-10-25 Thread Assaf Gordon

Hello,

The question "does head cause SIGPIPE" is seemingly simple,
and the answer is "yes" - but there are some nuances that might
cause unexpected results.

More specifically,
1. The "head" process terminates when all requested
   lines have been printed (e.g. one line with "head -n1").
2. The STDIN of the 'head' process is closed, which corresponds
   to the STDOUT of the preceding process ('find' in your case).
3. *IF* the 'find' process tries to write again to its STDOUT
   (which is now a closed pipe), then a SIGPIPE will be raised
   and 'find' will terminate.

On 2019-10-25 1:56 a.m., Ray Satiro wrote:

Recently I tracked down a delay in some scripts to this line:

find / -name filename* 2>/dev/null | head -n 1

[...]
owner@ubuntu1604-x64-vm:~$ ( trap '' pipe; find / -name initrd* 
2>/dev/null | strace -e 'trace=!all' head -n 1)

/initrd.img
+++ exited with 0 +++
(few seconds wait)


In your case,
I can guess that there is only a single file matching your predicate 
'initrd*'.

The 'head' indeed terminates, and the pipe is closed.
But if 'find' doesn't find any more matching files, it doesn't
try to print anything more, and SIGPIPE is never raised.

Note the manual page of pipe(7) says:
"If all file descriptors referring to  the  read
 end  of a pipe have been closed, then a write(2)
 will cause a SIGPIPE signal to be generated for the
 calling process."

So if no further files were found, 'find' continues (slowly scanning
the disk) until it finishes.

Since we only need the first line I can just use find options -print 
-quit and skip piping to head. But say we needed the first n results, 
how would I do that with head and get find to terminate rather than 
continue searching?


That's an interesting question, but perhaps better answered
in bug-findut...@gnu.org (although findutils maintainers are also
on this mailing list).

---

There could be other instances where the sending process won't receive 
SIGPIPE: if the entire output is very small (less than 4096 bytes on

linux, and at least 512 bytes on POSIX systems).

For example, this 'seq' won't be terminated by a signal,
as the entire output is just 21 bytes:

$ seq 10 | wc -c
21

$ seq 10 | head -n1

But this 'seq' will be terminated by a signal:

   $ seq 1 | wc -c
   48894

   $ seq 1 | head -n1

---

GNU 'time' can be used to quickly see how a process terminated (with a 
signal, or a non-zero exit code). It will print a line such as:

   "Command terminated by signal 13"
(signal 13 is SIGPIPE on linux).

   $ \time -p -f "" seq 1 | head -n1
   1
   Command terminated by signal 13

   $ \time -p -f "" seq 10 | head -n1
   1

And just couple of days ago a new experimental feature
was added to GNU time to allow finer printf-style output
about signals and exit codes:
  https://lists.gnu.org/archive/html/bug-time/2019-10/msg2.html

---

Lastly,
Recent version of GNU 'env' (from coreutils version 8.31, released on
March 2019) added new command-line options to ignore,block and restore
to default any signal, as a useful alternative to "trap ''",
see:

https://git.savannah.gnu.org/cgit/coreutils.git/commit/?id=95adadd9a420812ddd3f0fc6105f668922a97ae5
and the manual:
  https://www.gnu.org/software/coreutils/env

---

Hope this helps,
 - assaf







bug#37702: Suggestion for 'df' utility

2019-10-13 Thread Assaf Gordon

Hello Bernhard,

On 2019-10-13 3:57 p.m., Bernhard Voelker wrote:

On 2019-10-13 23:28, Paul Eggert wrote:

In any sane system there would be only
four lines of non-header output (for tmpfs etc, /, /home, and
/media/eggert/B827-D456), but df is outputting 28 lines.


What is so special about tmpfs so that you would like to see it?


As an interesting use-case (though not common),
I recently configured a raspberry PI device,
and wanted to mount as many locations on tmpfs as possible,
e.g. "/tmp" "/var/tmp", "/var/log" etc.

In was very useful in those cases to be able to see separate
tmpfs file system listed, with information about how big they
are and how much space was used.

Also in other systems where "/tmp" is a "tmpfs",
users might want to see how much space is available.

If we hide it by default, they can of course use "df /tmp"
or "df --all" - it's not about removing this option,
it is just about making users' life harder or easier,
and making unexpected changes.


I recently also encountered a change in a default behavior of
a program which I've been using a very long time - and it is *very*
frustrating to have something that worked "just fine" for so long
being changed.



Here on my openSUSE:Tumbleweed system, I see the following:

   $ df -T
   Filesystem Type 1K-blocks  Used Available Use% Mounted on

[...]

   /dev/loop0 ext2 31729 31729 0 100% 
/FULL_PARTITION_TMPDIR

[...]


(The /FULL_PARTITION_TMPDIR is used by a special coreutils test.)



That's an interesting case, where I would think you'd want to see it,
because you explicitly mounted it.



I think I could well live with adding 'devtmpfs' and 'tmpfs' to the
pseudo file systems in gnulib's "mountlist.c".


I agree, but think this needs to be communicated very well,
and in advance - perhaps announce this change ahead of time to
the respective package maintainers of each distribution - just so
they'll know it's coming (and also have a way to revert it if they don't
like it).



This seems to be a small change, and not satisfying the snap case.


Possibly hiding "squashfs" of readonly-mounts could get rid of those snaps?

regards,
  -assaf






bug#37702: Suggestion for 'df' utility

2019-10-13 Thread Assaf Gordon

On 2019-10-13 3:28 p.m., Paul Eggert wrote:
[..]
I mean c'mon, here's the output of 'df' on the Ubuntu 18.04.3 LTS 
workstation I'm typing this particular message on. In any sane system 
there would be only four lines of non-header output (for tmpfs etc, /, 
/home, and /media/eggert/B827-D456), but df is outputting 28 lines. This 
is ridiculous.




It is certainly inconvenient if that's not what you are looking for
(and certainly most desktop users aren't).

But I'm not sure if it's easy to find a set of criteria
that would work well while having minimal unexpected side effects of 
hiding entries people in other systems do expect to see.


Out of curiosity,
can you share the output of the following commands on the same system?

lsblk

df -x tmpfs -x devtmpfs -x squashfs


Thanks,
 - assaf





bug#37702: Suggestion for 'df' utility

2019-10-13 Thread Assaf Gordon

Hi all,

On 2019-10-13 2:27 p.m., Paul Eggert wrote:

On 10/13/19 2:41 AM, Pádraig Brady wrote:

I wonder could we key (also) on used==0||available==0.


Yes, looking at the sample output I gave earlier, I'd say we could by 
default drop filesystems where usage is 1% or less. That would solve the 
problem for my workstation. This is roughly akin to the "used==0" test 
you're suggesting.




I would humbly suggest caution with such unexpected user-facing changes
to the default output of 'df' - learning the lessons from changing the 
quotes in 'ls'.


Countless users have been using 'df' in their own ways, and have gotten
used to certain outputs.

This thread originated by a request to "clean up" the output on newer
ubuntu machines which use "snap" packages as /dev/loopN .

Let's not turn that into a drastic change that will affect many other
existing systems - the users on other systems did not ask for any changes.

---

Specifically for "default drop filesystems where usage is 1% or less" -
I can think of few cases off the top of my head where this would be
extremely confusing:

- I recently installed a 33TB raid file system. The usage on that system
is at %1 and will stay like so for at least several days.

- Amazon cloud services (AWS) offers an NFS4 service (they call it 
"EFS") that has reported size of 8 exabytes. There too usage could be at 
%1 for a long long time.


---


For cases where I want to list only the "real" storage, I typically use
an alias such as:

   alias dff='df -h -x tmpfs -x devtmpfs'

And it would be very easy and least disruptive to recommend
to ubuntu users to add "-x squashfs" or another file system to ignore.


Perhaps we can come up with a recommended list of "lesser" file systems
to ignore (or conditions such as read-only file
systems) and add it as a new option, but please let's not make it the
default.



My two cents,
 - assaf






Re: md5sum and recursive traversal of dirs

2019-10-12 Thread Assaf Gordon

( adding bug-time@ )

Hello,

On 2019-10-10 11:29 a.m., Сергей Кузнецов wrote:
[...]

By the way, I wrote two new small programs: xchg (or swap, which name is
better?) And exst (exit status).
[...] The second program launches the
program indicated at startup and, after its completion, prints the output
status or the caught signal.


Somewhat related:

the GNU Time program can report both exit code and signal
in the following way:

  $ env time -f "Exit code: %x\n" [SOME PROGRAM that dies with segfault]
  Command terminated by signal 11
  Exit code: 0

However, for a long time I wanted to add a new output format
specifier to GNU time that will indicate whether
a program existed cleaning or with a signal
(and which exit code or signal).

Your message reminded me of that, and I hope to add something
like that in the near future.

It could be something like:

   %T   1 if program terminated by a signal, empty otherwise
   %S   signal number of program terminal by a signal, empty otherwise
   %X   exit code if program terminated normal,
or empty if terminated by a singal

And could be used like so:

  time -f "Signaled: %T (signal number: %S)\nExit code: %X\n" [PROGRAM]


Please send comments and suggestions to bug-t...@gnu.org .

regards,
 - assaf

P.S.
Note that your built-in shell like has its own 'time' function.
To use GNU time run "env time" or "\time" .



Re: Is natural sort supported?

2019-10-08 Thread Assaf Gordon
(please use "reply-all" or "reply-group" to keep the coreutils@ mailing 
list in the loop)


On 2019-10-08 1:09 a.m., Peng Yu wrote:

Then, the option name causes misunderstand. -V is actually
--debian-version.


Or simply "--version-sort" as it is now.


The natural order is
plain and simple, just as what is explained below, which can be
implemented by a few lines of python code.


At the risk of arguing over semantics,
I'll say again: there is no "one correct" natural order standard,
and therefore it is not "plain and simple" because there is no just 
"one" such order.


It can certainly be there there are some specific implementation
of 'natural sort' that are simple.


https://blog.codinghorror.com/sorting-for-humans-natural-sort-order/

So my question is whether natural order as in the above URL is supported?


No.

and note that even the above blog writes:
"... Don't let Ned's clever Python ten-liner fool you. Implementing a 
natural sort is more complex than it seems ... ".




Re: Is natural sort supported?

2019-10-08 Thread Assaf Gordon

Hello,

On 2019-10-08 12:36 a.m., Peng Yu wrote:

The following example shows that version sort is not natural sort. Is
natural sort supported in by `sort`?


There is no such thing as "THE correct natural sort" order...


$ printf '%s\n' 1G13 1.02 | LC_ALL=C sort -k 1,1V # The result order
should have been reversed.


... therefore "should have" is simply incorrect expectation.

You might think it "should" be one way, and other implementations
think it "should" be another way.

For more details, please see the attached HTML file for details.

(this HTML file is a new chapter of the coreutils manual that will be
included in the next release. The source texinfo is here:
https://git.savannah.gnu.org/cgit/coreutils.git/tree/doc/sort-version.texi 
).


regards,
 - assaf

   #[1]Version sort ordering


1 Version sort ordering

   • [2]Version sort overview:
   • [3]Implementation Details:
   • [4]Differences from the official Debian Algorithm:
   • [5]Advanced Topics:
 __

   Next: [6]Implementation Details, Up: [7]Version sort ordering

  1.1 Version sort overview

   version sort ordering (and similarly, natural sort ordering) is a
   method to sort items such as file names and lines of text in an order
   that feels more natural to people, when the text contains a mixture of
   letters and digits.

   Standard sorting usually does not produce the order that one expects
   because comparisons are made on a character-by-character basis.

   Compare the sorting of the following items:

Alphabetical sort:   Version Sort:

a1   a1
a120 a2
a13  a13
a2   a120

   version sort functionality in GNU coreutils is available in the ‘ls
   -v’, ‘ls --sort=version’, ‘sort -V’, ‘sort --version-sort’ commands.

   • [8]Using version sort in GNU coreutils:
   • [9]Origin of version sort and differences from natural sort:
   • [10]Correct/Incorrect ordering and Expected/Unexpected results:
 __

   Next: [11]Origin of version sort and differences from natural sort, Up:
   [12]Version sort overview

1.1.1 Using version sort in GNU coreutils

   Two GNU coreutils programs use version sort: ls and sort.

   To list files in version sort order, use ls with -v or --sort=version
   options:

default sort:  version sort:

$ ls -1$ ls -1 -v
a1 a1
a100   a1.4
a1.13  a1.13
a1.4   a1.40
a1.40  a2
a2 a100

   To sort text files in version sort order, use sort with the -V option:

$ cat input
b3
b11
b1
b20


alphabetical order:version sort order:

$ sort input   $ sort -V input
b1 b1
b11b3
b20b11
b3 b20

   To sort a specific column in a file use -k/--key with ‘V’ ordering
   option:

$ cat input2
1000  b3   apples
2000  b11  oranges
3000  b1   potatoes
4000  b20  bananas

$ sort -k2V,2 input2
3000  b1   potatoes
1000  b3   apples
2000  b11  oranges
4000  b20  bananas
 __

   Next: [13]Correct/Incorrect ordering and Expected/Unexpected results,
   Previous: [14]Using version sort in GNU coreutils, Up: [15]Version sort
   overview

1.1.2 Origin of version sort and differences from natural sort

   In GNU coreutils, the name version sort was chosen because it is based
   on Debian GNU/Linux’s algorithm of sorting packages’ versions.

   Its goal is to answer the question “which package is newer,
   firefox-60.7.2 or firefox-60.12.3 ?”

   In coreutils this algorithm was slightly modified to work on more
   general input such as textual strings and file names (see
   [16]Differences from the official Debian Algorithm).

   In other contexts, such as other programs and other programming
   languages, a similar sorting functionality is called [17]natural sort.
 __

   Previous: [18]Origin of version sort and differences from natural sort,
   Up: [19]Version sort overview

1.1.3 Correct/Incorrect ordering and Expected/Unexpected results

   Currently there is no standard for version/natural sort ordering.

   That is: there is no one correct way or universally agreed-upon way to
   order items. Each program and each programming language can decide its
   own ordering algorithm and call it ’natural sort’ (or other various
   names).

   See [20]Other version/natural sort implementations for many examples of
   differing sorting possibilities, each with its own rules and
   variations.

   If you do suspect a bug in coreutils’ implementation of version-sort,
   see [21]Reporting bugs or 

Re: The output from GNU Core Utilities dd is different in apline and ubuntu

2019-09-09 Thread Assaf Gordon

Hello,

On 2019-09-09 6:39 a.m., 薛帅 wrote:

In Ubuntu 18.04.1 LTS, the `dd` command output three lines.

[...]

While in apline 3.9.0, the `dd` command output only two lines.


Alpine linux does not use "coreutils" programs in the default
installation. Most of the equivalent programs are from busybox.

To see which implementation you are using,
try:

   # which dd
   /bin/dd
   # ls -l /bin/dd
   lrwxrwxrwx  1 root  root 12 Sep  9 13:22 /bin/dd -> /bin/busybox

Also,
coreutils' dd supports "--version":

  $ dd --version | head -n1
  dd (coreutils) 8.30

while busybox's dd will show its version in the help/usage screen
(which is shown when unsupported option "--version" is used):

   # dd --version 2>&1 | head -n1
   BusyBox v1.29.3 (2019-01-24 07:45:07 UTC) multi-call binary.


regards,
 - assaf




bug#37093: wc runs 100% cpu when in pipeline or tee >(wc)

2019-08-20 Thread Assaf Gordon

tag 37093 notabug
close 37093
stop

Hello,

On 2019-08-19 10:44 p.m., Edward Huff wrote:

In the demo below, dd uses 0.665s to write 1GiB of zeros.
sha256sum uses 4.285s to calculate the sha256 of 1GiB of zeros.
wc uses 32.160s to count 1GiB of zeros.


[...]


baseline results:
$ dd if=/dev/zero count=$((1024*1024)) bs=1024 | tee >(sha256sum>&2) | wc
1048576+0 records in
1048576+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 32.5007 s, 33.0 MB/s
49bc20df15e412a64472421e13fe86ff1c5165e18b2afccf160d4dc19fe68a14  -
   0   0 1073741824
$


First,
Try to avoid UTF8 locales (i.e., force a C/POSIX locale with LC_ALL=C)
which makes 'wc' much faster.

On my computer:

With UTF8 locale:

  $ dd if=/dev/zero count=$((1024*1024)) bs=1024 \
| tee >(sha256sum>&2) | time --portability wc
  1048576+0 records in
  1048576+0 records out
  1073741824 bytes (1.1 GB, 1.0 GiB) copied, 46.5928 s, 23.0 MB/s
  49bc20df15e412a64472421e13fe86ff1c5165e18b2afccf160d4dc19fe68a14  -
0   0 1073741824
  real 46.59
  user 46.37
  sys 0.19

With C locale:

  $ dd if=/dev/zero count=$((1024*1024)) bs=1024 \
   | tee >(sha256sum>&2) | LC_ALL=C time --portability wc
  1048576+0 records in
  1048576+0 records out
  1073741824 bytes (1.1 GB, 1.0 GiB) copied, 8.60285 s, 125 MB/s
  49bc20df15e412a64472421e13fe86ff1c5165e18b2afccf160d4dc19fe68a14  -
0   0 1073741824
  real 8.60
  user 5.22
  sys 0.26


Second,
The "word counting" feature in 'wc' is the main cpu-hog.
If you avoid that (i.e. counting only lines, or only characters),
'wc' is even faster (and it automatically ignores UTF8 issues):

  $ dd if=/dev/zero count=$((1024*1024)) bs=1024 \
   | tee >(sha256sum>&2) \
   | \time --portability wc -c
  1048576+0 records in
  1048576+0 records out
  1073741824 bytes (1.1 GB, 1.0 GiB) copied, 7.59429 s, 141 MB/s
  49bc20df15e412a64472421e13fe86ff1c5165e18b2afccf160d4dc19fe68a14  -
  1073741824

  real 7.59
  user 0.10
  sys 0.71

Notice that the "real time" wasn't changed much (from 8.6s to 7.59s), 
but the actual work performed by 'wc' (measured in "user time") is down

drastically.


Third,
If you are comfortable with compiling Coreutils from source,
you can build it using optimized hashing function from OpenSSL, like so:

 ./configure --with-openssl
 make

Then, "sha256sum" will be faster (about 2x fast on my computer).

If you don't want to re-compile it, consider using "openssl" directly
to calculate the checksum, like so:

  dd if=/dev/zero count=1K bs=1M | tee >(openssl sha256>&2) | wc -c


Fourth,
To save few more microseconds, consider using dd with larger block size 
(bs=) and fewer blocks (count=), e.g.:


   $ time dd if=/dev/zero of=/dev/null count=1M bs=1K
   1048576+0 records in
   1048576+0 records out
   1073741824 bytes (1.1 GB, 1.0 GiB) copied, 0.865853 s, 1.2 GB/s

   real 0m0.868s
   user 0m0.288s
   sys  0m0.579s

   $ time dd if=/dev/zero of=/dev/null count=1K bs=1M
   1024+0 records in
   1024+0 records out
   1073741824 bytes (1.1 GB, 1.0 GiB) copied, 0.0998688 s, 10.8 GB/s

   real 0m0.102s
   user 0m0.000s
   sys  0m0.102s

This won't reduce the total time by much, but will result in
fewer sys-calls, and less CPU kernel time (at least by a tiny bit).
The effect is more noticeable when reading or writing to a physical disk.



Lastly,
If you use GNU time instead of the shell's built-in 'time' function,
you can specify custom output format,
and easily show the timing of each program in the pipeline.
Example:

$ FMT="\n=== CMD: %C ===\nreal %e\tuser %U\tsys %S\n"
$ \time -f "$FMT" dd if=/dev/zero count=1M bs=1K \
 | \time -f "$FMT" tee >(\time -f "$FMT" sha256sum>&2) \
 | \time -f "$FMT" wc -c
1048576+0 records in
1048576+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 7.77339 s, 138 MB/s

=== CMD: dd if=/dev/zero count=1048576 bs=1024 ===
real 7.77   user 0.36   sys 1.65


=== CMD: tee /dev/fd/63 ===
real 7.77   user 0.10   sys 1.30

49bc20df15e412a64472421e13fe86ff1c5165e18b2afccf160d4dc19fe68a14  -

=== CMD: sha256sum ===
real 7.77   user 7.47   sys 0.27

1073741824

=== CMD: wc -c ===
real 7.77   user 0.05   sys 0.76


As such, I'm closing this as "not a bug",
but discussion can continue by replying to this thread.

regards,
 - assaf






bug#37058: Error message with local deployment of Galaxy-k8s

2019-08-16 Thread Assaf Gordon

tag 37058 notabug
close 37058
stop

Hello,

Two issues are mixed here.

First:

On 2019-08-16 2:17 p.m., Gao, Jianliang wrote:
I followed https://github.com/phnmnl/phenomenal-h2020/wiki/QuickStart-Installation-for-Local-PhenoMeNal-Workflow with Older Galaxy chart to deploy local galaxy-k8s instance with minikube on Windows 10. The following message came from the logs of my pod. I can't connect to my local instance. 


[...]

kubectl logs galaxy-k8s-tr6fc
[ run_galaxy_config.sh ] -- Galaxy sqlite directory created since we are not 
using postgresql
[ run_galaxy_config.sh ] -- Replaced galaxy ini for the user's injected one

[...]

dpkg-preconfigure: unable to re-open stdin:
  [WARNING]: It is unneccessary to use '{{' in loops, leave variables in loop
expressions bare.

[...]

galaxy.tools.deps WARNING 2019-08-16 19:20:48,175 Path 
'./database/dependencies' does not exist, ignoring
galaxy.tools.deps WARNING 2019-08-16 19:20:48,175 Path 
'./database/dependencies' is not directory, ignoring
galaxy.tools.deps.installable WARNING 2019-08-16 19:20:48,190 Conda not 
installed and auto-installation disabled.
galaxy.tools.deps.installable WARNING 2019-08-16 19:20:48,190 Conda not 
installed and auto-installation disabled.


These are issues related your Galaxy setup.

(for other readers: "Galaxy" in this context is a web-based framework 
for bioinformatics analysis, see https://galaxyproject.org/ and 
https://usegalaxy.org ).


Such issues are best asked in their support forums:
   https://galaxyproject.org/support/
   https://help.galaxyproject.org

This includes problems in underlying layers, such as the 'dpkg' errors
above that result from deploying Galaxy VMs or instances or kubernetes 
or containers etc.



tail: unrecognized file system type 0x794c7630 for 'paster.log'. please report 
this to bug-coreutils@gnu.org. reverting to polling


This warning indeed comes from coreutils program 'tail',
however it is harmless in your situation.
For more details, see here:
  https://www.gnu.org/software/coreutils/filesystems.html

---

A cursory look at the error logs makes it seem like
"bug-coreutils@gnu.org" is the place to ask General questions about 
"Galaxy" server (because it is the last thing mentioned),

but that is not the case.
We can only help with coreutils programs (e.g. 'tail').

Please contact the Galaxy team for galaxy-related issues.

Hope this helps.
regards,
 - assaf







Re: building old coreutils versions on new glibc systems

2019-08-15 Thread Assaf Gordon

On 2019-08-13 11:45 p.m., Bernhard Voelker wrote:

On 8/13/19 8:10 PM, Bernhard Voelker wrote:

I'd only like to see following additional changes:

- make the script callable from an arbitrary directory, i.e.,
   make the file name of the patches relative to the script, and

- mention to adjust MANPATH (because that also works with the
   common directory: 'man df-8.23').

WDYT?


... and we need to make
   sc_prohibit_tab_based_indentation and
   sc_long_lines
happy again.

(BTW: The latter would alternatively be fixed if the patches would
be named *.diff instead of *.patch.)



Thanks again for the review and improvements.

Pushed here with your suggestions (+ very minor script changes):
  http://git.savannah.gnu.org/cgit/coreutils.git/commit/?id=b8609c7cf

-assaf


P.S.
We can still add a new web page (echoing the README.older-versions)
to the coreutils website, to ease finding this information via a search 
engine - WDYT ?





Re: [Implemented] [coreutils] Partial UTF-8 support for "cut -c"

2019-08-12 Thread Assaf Gordon
Hello,

On Mon, Aug 12, 2019 at 09:19:54PM +0200, jaime.mosqu...@tutanota.com wrote:
> I have partially implemented the option "-c" ("--characters") for UTF-8
> non-ASCII characters[...]

First and foremost,
Thank you for taking the time and effort to develop new features and
send them to the mailing list.

> This implementation has two, somewhat important shortcomings:
>
> * Other encodings are not implemented.
> [...] I decided to stick with just UTF-8.

At this point in time, this limitation is a show-stopper.
A multibyte-aware implementation for GNU coreutils (for all programs,
not just for 'cut') should support all native encodings.

Ostensibly, this should be implementated using the standard
mbrtowc(3)/mbstowcs(3) family of functions - but in reality there is
another complication - a good implementation should also support
systems where 'char_t' is limited to 16bit (instead of 32bit),
and therefore require handling of unicode surrogate pairs.

You can read more about the programs (and past suggested solutions) here
https://crashcourse.housegordon.org/coreutils-multibyte-support.html

(as a side node to other readers: if these are not a show-stopper
requirements any longer, please chime in - this will make things much
easier.)

> * Modifier characters are treated as individual characters [...]
> Decisively, many languages from Western Europe (Spanish,
> Portuguese...) might or might not work with this program, depending on
> which kind of accented letters are produced [...]

I see two related but separate issues here.

The first is generally called "unicode normalization", e.g.
if the user sees the letter "A" with acute accent, is it encoded as one
unicode character (U+00C1, "Latin Capital Letter A with Acute")
or two unicode characters ("A" followed by U+0301 "Combining Acute
Accent").

This issue is not a problem (in the sense that it's OK if cut treats
"A" followed by U+0301 as separate characters) - because we will also
include an additional program that can convert from one form to the
other (called "unorm" in the URL mentioned above).

The second interesting issue are the (new?) modifiers such as
the U+1F3FB "EMOJI MODIFIER FITZPATRICK" 
(http://unicode.org/reports/tr51/#Diversity
https://codepoints.net/U+1F3FB)
that affect other characters.
Here I don't see a easy way to know if characters should be grouped,
and they should probably be treated as separate characters in all cases.


> On the other hand, missing bytes in a multibyte UTF-8 characters are 
> correctly handled
[...]
> It is my hope that you should find this first approach to the problem 
> sufficient for most uses, and incorporate it into the mainstream code.

I would say that your approach of dealing only with UTF-8 has some merits
(i.e., as a "fast path" in parallel to slower mbrtowc(3) part,
and the faster unibyte path).
I suspect that if we do go down that road, it'll be better to use
gnulib's already implemented UTF-8 code (and also UTF-16/UTF-32) instead
of adding ad-hoc functions.

> (Should my modifications be big enough to require it for copyright
> reasons, my name is "Jaime Mosquera", and I obviously agree to the
> terms of the GNU GPL.)

Thank you - that is indeed the gist (copyright assignment is needed from
contributors), but the technicalities are slightly different.

We ask that contributors fill and send the following form:
  
https://git.savannah.gnu.org/cgit/gnulib.git/tree/doc/Copyright/request-assign.future
explained 'why?' here: https://www.gnu.org/licenses/why-assign.en.html

regards,
 - assaf



Re: for the next gnulib update

2019-08-12 Thread Assaf Gordon
On Mon, Aug 12, 2019 at 05:55:55PM +0200, Bernhard Voelker wrote:
> On 8/12/19 5:50 AM, Assaf Gordon wrote:
> > Updated patch (fixed typo in commit message).
>
> +1 thanks

thanks, pushed here:
https://git.savannah.gnu.org/cgit/coreutils.git/commit/?id=a3d070fa3269e89dfad49fde8ea30758afa36f4b



Re: for the next gnulib update

2019-08-11 Thread Assaf Gordon
On Sun, Aug 11, 2019 at 09:33:47PM -0600, Assaf Gordon wrote:
> Hello,
> 
> On Sun, Aug 11, 2019 at 10:42:49AM +0200, Bruno Haible wrote:
> > A couple of changes in gnulib on 2019-07-15 [1] need updates on the 
> > coreutils
> > side, the next you update the gnulib used by coreutils.
> 
> Thanks for the heads-up.
> 
> Patch attached - I'll apply it tomorrow if there are no further comments.
>

Updated patch (fixed typo in commit message).

 
>From fc120af40548e63a98644f9f075710259a00 Mon Sep 17 00:00:00 2001
From: Bruno Haible 
Date: Sun, 11 Aug 2019 21:29:00 -0600
Subject: [PATCH] build: adjust for recent gnulib pthread changes

Discussed in https://lists.gnu.org/r/coreutils/2019-08/msg00030.html .

* bootstrap.conf (gnulib_modules): Replace 'pthread' with
pthread-* modules.
* src/sort.c: Remove GNULIB_defined_pthread_functions conditional.
---
 bootstrap.conf | 5 -
 src/sort.c | 5 -
 2 files changed, 4 insertions(+), 6 deletions(-)

diff --git a/bootstrap.conf b/bootstrap.conf
index 49261524a..018bc4eb3 100644
--- a/bootstrap.conf
+++ b/bootstrap.conf
@@ -196,7 +196,10 @@ gnulib_modules="
   priv-set
   progname
   propername
-  pthread
+  pthread-cond
+  pthread-mutex
+  pthread-thread
+  pthread_sigmask
   putenv
   quote
   quotearg
diff --git a/src/sort.c b/src/sort.c
index d812aa999..360a1f140 100644
--- a/src/sort.c
+++ b/src/sort.c
@@ -82,11 +82,6 @@ struct rlimit { size_t rlim_cur; };
 # endif
 #endif
 
-#if GNULIB_defined_pthread_functions
-# undef pthread_sigmask
-# define pthread_sigmask(how, set, oset) sigprocmask (how, set, oset)
-#endif
-
 #if !defined OPEN_MAX && defined NR_OPEN
 # define OPEN_MAX NR_OPEN
 #endif
-- 
2.20.1



Re: for the next gnulib update

2019-08-11 Thread Assaf Gordon
Hello,

On Sun, Aug 11, 2019 at 10:42:49AM +0200, Bruno Haible wrote:
> A couple of changes in gnulib on 2019-07-15 [1] need updates on the coreutils
> side, the next you update the gnulib used by coreutils.

Thanks for the heads-up.

Patch attached - I'll apply it tomorrow if there are no further comments.

-assaf
>From fc120af40548e63a98644f9f075710259a00 Mon Sep 17 00:00:00 2001
From: Bruno Haible 
Date: Sun, 11 Aug 2019 21:29:00 -0600
Subject: [PATCH] build: adjust for recent gnulib pthread changes

Discussed in https://lists.gnu.org/r/coreutils/2019-08/msg00030.html .

* bootstrap.conf (gnulib_modules): Replace 'pthread' with pthread-X
moduels.
* src/sort.c: Remove GNULIB_defined_pthread_functions conditional.
---
 bootstrap.conf | 5 -
 src/sort.c | 5 -
 2 files changed, 4 insertions(+), 6 deletions(-)

diff --git a/bootstrap.conf b/bootstrap.conf
index 49261524a..018bc4eb3 100644
--- a/bootstrap.conf
+++ b/bootstrap.conf
@@ -196,7 +196,10 @@ gnulib_modules="
   priv-set
   progname
   propername
-  pthread
+  pthread-cond
+  pthread-mutex
+  pthread-thread
+  pthread_sigmask
   putenv
   quote
   quotearg
diff --git a/src/sort.c b/src/sort.c
index d812aa999..360a1f140 100644
--- a/src/sort.c
+++ b/src/sort.c
@@ -82,11 +82,6 @@ struct rlimit { size_t rlim_cur; };
 # endif
 #endif
 
-#if GNULIB_defined_pthread_functions
-# undef pthread_sigmask
-# define pthread_sigmask(how, set, oset) sigprocmask (how, set, oset)
-#endif
-
 #if !defined OPEN_MAX && defined NR_OPEN
 # define OPEN_MAX NR_OPEN
 #endif
-- 
2.20.1



Re: parse-datetime.y - Military Timezones are inverted from the correct sense

2019-08-11 Thread Assaf Gordon

On 2019-08-10 9:17 p.m., Assaf Gordon wrote:

On Sat, Aug 10, 2019 at 01:05:23PM -0700, Paul Eggert wrote:

The attached patch-set includes this fix,
and the updated NEWS wording.
(I'll wait until gnulib is updated with the additional fix,
then create a new coreutil patch with the latest gnulib.)


Thanks here too; it all sounds good.


Attached latest version (with updated gnulib, and Bernhard's
syntax-check fix).

I'll push tomorrow unless other issues pop up.

-assaf



Pushed here:
https://git.savannah.gnu.org/cgit/coreutils.git/commit/?id=725c8d6bed902a181da867a5d38efd01f62d8c9a



Re: building old coreutils versions on new glibc systems

2019-08-10 Thread Assaf Gordon
Hello,

On Sat, Aug 10, 2019 at 03:19:57PM +0200, Bernhard Voelker wrote:
> On 8/7/19 6:04 PM, Jim Meyering wrote:
> > Since it is something that may contribute to binaries I build (with
> > the handy related build target), it feels like it belongs in
> > version-control

> okay, fine.  Both variants have advantages and disadvantages, so let's
> go for the variant easier to maintain.
>
> I'll reply wrt/ the patch in the other email.

Thanks for the improved script, the suggestion and the testing.

Attached updated patch. changes are:
- moved to ./script/build-older-versions
- trimmed whitespace from patches
- used your version of the script
- added permissive license to the script
- added a short blurb at the end of the script, showing PATHs.

Comments welcomed,
 -assaf


0001-scripts-document-how-to-build-older-versions-on-newe.patch.gz
Description: application/gunzip


Re: parse-datetime.y - Military Timezones are inverted from the correct sense

2019-08-10 Thread Assaf Gordon
On Sat, Aug 10, 2019 at 01:05:23PM -0700, Paul Eggert wrote:
> > The attached patch-set includes this fix,
> > and the updated NEWS wording.
> > (I'll wait until gnulib is updated with the additional fix,
> > then create a new coreutil patch with the latest gnulib.)
>
> Thanks here too; it all sounds good.

Attached latest version (with updated gnulib, and Bernhard's
syntax-check fix).

I'll push tomorrow unless other issues pop up.

-assaf
>From 961d668eea9c94beddd309d81f65c32a133a3260 Mon Sep 17 00:00:00 2001
From: Assaf Gordon 
Date: Fri, 9 Aug 2019 19:51:42 -0600
Subject: [PATCH 1/3] gnulib: update to latest

---
 gnulib | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/gnulib b/gnulib
index c7d0b4506..8524167df 16
--- a/gnulib
+++ b/gnulib
@@ -1 +1 @@
-Subproject commit c7d0b4506574887be5835ae9ae892d365afbb98c
+Subproject commit 8524167df6555c38079e9d041044dc59a9ddbeee
-- 
2.20.1


>From 3cbddd58fde11c911134bdfe79fc3f2579ba58e1 Mon Sep 17 00:00:00 2001
From: Bernhard Voelker 
Date: Mon, 22 Jul 2019 08:53:28 +0200
Subject: [PATCH 2/3] maint: add lib/argmatch.h to po/POTFILES.in

* po/POTFILES.in (lib/argmatch.h): Add to avoid sc_po_check error:
"maint.mk: you have changed the set of files with translatable \
 diagnostics;"
---
 po/POTFILES.in | 1 +
 1 file changed, 1 insertion(+)

diff --git a/po/POTFILES.in b/po/POTFILES.in
index 60c5124ac..4231f56c4 100644
--- a/po/POTFILES.in
+++ b/po/POTFILES.in
@@ -3,6 +3,7 @@
 
 # These are nominally temporary...
 lib/argmatch.c
+lib/argmatch.h
 lib/closein.c
 lib/closeout.c
 lib/copy-acl.c
-- 
2.20.1


>From 725c8d6bed902a181da867a5d38efd01f62d8c9a Mon Sep 17 00:00:00 2001
From: Assaf Gordon 
Date: Fri, 9 Aug 2019 20:16:06 -0600
Subject: [PATCH 3/3] date: mention military timezone changes from gnulib

Gnulib commits f1f10d47be8762e4ca17c8957a0520b08d28abfb and
0673d8ab42c9bb0cf618a21b537cdd8fb976fb73 negated the meaning of
military timezones parsed in gnu date.
See https://lists.gnu.org/r/bug-gnulib/2019-08/msg5.html and
https://lists.gnu.org/r/coreutils/2019-08/msg00021.html

* NEWS: Mention this user-visible change.
* tests/misc/date.pl: Add tests for the new behavior.
---
 NEWS   | 13 +
 tests/misc/date.pl | 10 ++
 2 files changed, 23 insertions(+)

diff --git a/NEWS b/NEWS
index 97c9d18bd..6719e504d 100644
--- a/NEWS
+++ b/NEWS
@@ -49,6 +49,19 @@ GNU coreutils NEWS-*- 
outline -*-
   coherency of file system attributes, useful on network file systems.
 
 
+** Changes in behavior
+
+  date now parses military time zones in accordance with common usage:
+"A" to "M"  are equivalent to UTC+1 to UTC+12
+"N" to "Y"  are equivalent to UTC-1 to UTC-12
+"Z" is "zulu" time (UTC).
+  For example, 'date -d "09:00B" is now equivalent to 9am in UTC+2 time zone.
+  Previously, military time zones were parsed according to the obsolete
+  rfc822, with their value negated (e.g., "B" was equivalent to UTC-2).
+  [The old behavior was introduced in sh-utils 2.0.15 ca. 1999, predating
+  coreutils package.]
+
+
 * Noteworthy changes in release 8.31 (2019-03-10) [stable]
 
 ** Bug fixes
diff --git a/tests/misc/date.pl b/tests/misc/date.pl
index 9ba3d3983..92755b1f2 100755
--- a/tests/misc/date.pl
+++ b/tests/misc/date.pl
@@ -300,6 +300,16 @@ my @Tests =
 
  # https://bugs.gnu.org/34608
  ['date-century-plus', '-d @0 +.%+4C.', {OUT => '.+019.'}],
+
+
+ # Military time zones, new behavior (since 8.32)
+ # https://lists.gnu.org/r/bug-gnulib/2019-08/msg5.html
+ ['mtz1', '-u -d "09:00B" +%T', {OUT => '07:00:00'}],
+ ['mtz2', '-u -d "09:00L" +%T', {OUT => '22:00:00'}],
+ ['mtz3', '-u -d "09:00N" +%T', {OUT => '10:00:00'}],
+ ['mtz4', '-u -d "09:00T" +%T', {OUT => '16:00:00'}],
+ ['mtz5', '-u -d "09:00X" +%T', {OUT => '20:00:00'}],
+ ['mtz6', '-u -d "09:00Z" +%T', {OUT => '09:00:00'}],
 );
 
 # Repeat the cross-dst test, using Jan 1, 2005 and every interval from 1..364.
-- 
2.20.1



Re: parse-datetime.y - Military Timezones are inverted from the correct sense

2019-08-10 Thread Assaf Gordon
Hello,

(adding bug-gnulib again :) )

Thank you both for the review and suggestions.

On 2019-08-10 1:46 a.m., Paul Eggert wrote:
> Assaf Gordon wrote:
>> I suggest the attached patch for coreutils.
>
> OK, except I'd remove "in accordance with rfc5322" since RFC 5322
> recommends treating all these zones as if they were UTC. Also, "T"
> continues to have its military meaning (i.e., between "S" and "U") if
> it's used properly.

Good point about 'T'.
After adding an additional test for it, I realized the gnulib fix wasn't
complete because it didn't negate the 'T' value from UTC+7 to UTC-7.
Attached suggested follow-up patch for gnulib.

On Sat, Aug 10, 2019 at 05:40:30PM +0200, Bernhard Voelker wrote:
> On 8/10/19 4:26 AM, Assaf Gordon wrote:
> > This results in a user-visible change for gnu date,
> > I suggest the attached patch for coreutils.
> 
> The gnulib update requires the attached to calm down sc_po_check.
> You may squash that into your gnulib update commit (or leave it separate).

Good catch.
The attached patch-set includes this fix,
and the updated NEWS wording.
(I'll wait until gnulib is updated with the additional fix,
then create a new coreutil patch with the latest gnulib.)

regards,
 - assaf

>From 9f464d51d8311f33340942c76e758454fa59042d Mon Sep 17 00:00:00 2001
From: Assaf Gordon 
Date: Sat, 10 Aug 2019 13:17:49 -0600
Subject: [PATCH] parse-datetime: fix 'T' military timezone handling

follow-up to the previous commit: the 'T' case is handled outside the
conversion table (used as either military timezone UTC-7 or ISO8601
separator). Change it from "HOUR(7)" to "-HOUR(7)" to match other
timezone letters.

* lib/parse-datetime.y: Change 'T' value from UTC+7 yo UTC-7.
* ChangeLog: Mention the change.
---
 ChangeLog| 8 
 lib/parse-datetime.y | 4 ++--
 2 files changed, 10 insertions(+), 2 deletions(-)

diff --git a/ChangeLog b/ChangeLog
index 7616b5efd..7c25c53e5 100644
--- a/ChangeLog
+++ b/ChangeLog
@@ -1,3 +1,11 @@
+2019-08-10  Assaf Gordon 
+
+   parse-datetime: fix 'T' military timezone handling
+   follow-up to the previous commit: the 'T' case is handled outside the
+   conversion table (used as either military timezone UTC-7 or ISO8601
+   separator). Change it from "HOUR(7)" to "-HOUR(7)" to match other
+   timezone letters.
+
 2019-08-09  Paul Eggert  
 
parse-datetime: fix military timezone letters
diff --git a/lib/parse-datetime.y b/lib/parse-datetime.y
index d371b9cb1..218e3dc5b 100644
--- a/lib/parse-datetime.y
+++ b/lib/parse-datetime.y
@@ -754,14 +754,14 @@ zone:
 tZONE
   { pc->time_zone = $1; }
   | 'T'
-  { pc->time_zone = HOUR (7); }
+  { pc->time_zone = -HOUR (7); }
   | tZONE relunit_snumber
   { pc->time_zone = $1;
 if (! apply_relative_time (pc, $2, 1)) YYABORT;
 debug_print_relative_time (_("relative"), pc);
   }
   | 'T' relunit_snumber
-  { pc->time_zone = HOUR (7);
+  { pc->time_zone = -HOUR (7);
 if (! apply_relative_time (pc, $2, 1)) YYABORT;
     debug_print_relative_time (_("relative"), pc);
   }
-- 
2.20.1

>From 19f7eab06af234641a2927514c03570c07a311db Mon Sep 17 00:00:00 2001
From: Assaf Gordon 
Date: Fri, 9 Aug 2019 19:51:42 -0600
Subject: [PATCH 1/3] gnulib: update to latest

---
 gnulib | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/gnulib b/gnulib
index c7d0b4506..f1f10d47b 16
--- a/gnulib
+++ b/gnulib
@@ -1 +1 @@
-Subproject commit c7d0b4506574887be5835ae9ae892d365afbb98c
+Subproject commit f1f10d47be8762e4ca17c8957a0520b08d28abfb
-- 
2.20.1


>From 673a0360b9dd96cfe0df017febb40980843cbb84 Mon Sep 17 00:00:00 2001
From: Bernhard Voelker 
Date: Mon, 22 Jul 2019 08:53:28 +0200
Subject: [PATCH 2/3] maint: add lib/argmatch.h to po/POTFILES.in

* po/POTFILES.in (lib/argmatch.h): Add to avoid sc_po_check error:
"maint.mk: you have changed the set of files with translatable \
 diagnostics;"
---
 po/POTFILES.in | 1 +
 1 file changed, 1 insertion(+)

diff --git a/po/POTFILES.in b/po/POTFILES.in
index 60c5124ac..4231f56c4 100644
--- a/po/POTFILES.in
+++ b/po/POTFILES.in
@@ -3,6 +3,7 @@
 
 # These are nominally temporary...
 lib/argmatch.c
+lib/argmatch.h
 lib/closein.c
 lib/closeout.c
 lib/copy-acl.c
-- 
2.20.1


>From 0c552fac1991f49ef2db347adaed7bd82b935d70 Mon Sep 17 00:00:00 2001
From: Assaf Gordon 
Date: Fri, 9 Aug 2019 20:16:06 -0600
Subject: [PATCH 3/3] date: mention military timezone changes from gnulib

Gnulib commit f1f10d47be8762e4ca17c8957a0520b08d28abfb (based on
https://lists.gnu.org/r/bug-gnulib/2019-08/msg5.html) negated the
meaning of military timezones parsed in gnu date.

* NEWS: Mention this user-visible change.
* tests/misc/date.pl: Add tests for the new behavior.
---
 NEWS   | 13 +
 t

Re: parse-datetime.y - Military Timezones are inverted from the correct sense

2019-08-09 Thread Assaf Gordon
Hello,

On Fri, Aug 09, 2019 at 02:01:35PM -0700, Paul Eggert wrote:
> Since the RFC 822 error was fixed in 2001 when RFC 2822 came out, it is long
> past time to fix parse-datetime.y accordingly, so I installed the attached
> patch into Gnulib.

This results in a user-visible change for gnu date,
I suggest the attached patch for coreutils.

-assaf
>From 19f7eab06af234641a2927514c03570c07a311db Mon Sep 17 00:00:00 2001
From: Assaf Gordon 
Date: Fri, 9 Aug 2019 19:51:42 -0600
Subject: [PATCH 1/2] gnulib: update to latest

---
 gnulib | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/gnulib b/gnulib
index c7d0b4506..f1f10d47b 16
--- a/gnulib
+++ b/gnulib
@@ -1 +1 @@
-Subproject commit c7d0b4506574887be5835ae9ae892d365afbb98c
+Subproject commit f1f10d47be8762e4ca17c8957a0520b08d28abfb
-- 
2.20.1


>From 6eb1118f00a7018f08f69c7ace86cd92f89ca961 Mon Sep 17 00:00:00 2001
From: Assaf Gordon 
Date: Fri, 9 Aug 2019 20:16:06 -0600
Subject: [PATCH 2/2] date: mention military timezone changes from gnulib

Gnulib commit f1f10d47be8762e4ca17c8957a0520b08d28abfb (based on
https://lists.gnu.org/r/bug-gnulib/2019-08/msg5.html) negated the
meaning of military timezones parsed in gnu date.

* NEWS: Mention this user-visible change.
* tests/misc/date.pl: Add tests for the new behavior.
---
 NEWS   | 16 
 tests/misc/date.pl |  9 +
 2 files changed, 25 insertions(+)

diff --git a/NEWS b/NEWS
index 97c9d18bd..d4904d20b 100644
--- a/NEWS
+++ b/NEWS
@@ -49,6 +49,22 @@ GNU coreutils NEWS-*- 
outline -*-
   coherency of file system attributes, useful on network file systems.
 
 
+** Changes in behavior
+
+  date now parses military time zones in accordance with rfc5322:
+"A" to "M"  are equivalent to UTC+1 to UTC+12
+"N" to "S"  are equivalent to UTC-1 to UTC-6
+"U" to "Y"  are equivalent to UTC-8 to UTC-12
+"T" is parsed as a ISO-8601 format representation,
+and should not be used for military time zones in gnu date.
+"Z" is "zulu" time (UTC).
+  For example, 'date -d "09:00B" is now equivalent to 9am in UTC+2 time zone.
+  Previously, military time zones were parsed according to the obsolete
+  rfc822, with their value negated (e.g., "B" was equivalent to UTC-2).
+  [The old behavior was introduced in sh-utils 2.0.15 ca. 1999, predating
+  coreutils package.]
+
+
 * Noteworthy changes in release 8.31 (2019-03-10) [stable]
 
 ** Bug fixes
diff --git a/tests/misc/date.pl b/tests/misc/date.pl
index 9ba3d3983..e11753347 100755
--- a/tests/misc/date.pl
+++ b/tests/misc/date.pl
@@ -300,6 +300,15 @@ my @Tests =
 
  # https://bugs.gnu.org/34608
  ['date-century-plus', '-d @0 +.%+4C.', {OUT => '.+019.'}],
+
+
+ # Military time zones, new behavior (since 8.32)
+ # https://lists.gnu.org/r/bug-gnulib/2019-08/msg5.html
+ ['mtz1', '-u -d "09:00B" +%T', {OUT => '07:00:00'}],
+ ['mtz2', '-u -d "09:00L" +%T', {OUT => '22:00:00'}],
+ ['mtz3', '-u -d "09:00N" +%T', {OUT => '10:00:00'}],
+ ['mtz4', '-u -d "09:00X" +%T', {OUT => '20:00:00'}],
+ ['mtz5', '-u -d "09:00Z" +%T', {OUT => '09:00:00'}],
 );
 
 # Repeat the cross-dst test, using Jan 1, 2005 and every interval from 1..364.
-- 
2.20.1



bug#36985: tail

2019-08-09 Thread Assaf Gordon

close 36985
stop

Hello,

On 2019-08-09 12:55 a.m., Rob Hearne wrote:

root@kafka-robh-vmdub-04:/kafka/bin# tail -f Control
tail: unrecognized file system type 0x794c7630 for ‘Control’. please report
this to bug-coreutils@gnu.org. reverting to polling



This has been fixed in version 8.25 (released in 2016).
For more details, see
https://www.gnu.org/software/coreutils/filesystems.html

-assaf





Re: building old coreutils versions on new glibc systems

2019-08-06 Thread Assaf Gordon
Hello,

On Tue, Aug 06, 2019 at 09:35:01PM +0200, Bernhard Voelker wrote:
> On 8/2/19 9:05 AM, Jim Meyering wrote:
> > Nice work. I've had to go through this process a few times over the
> > years, and having these handy patch files checked in and maintained
> > would make it easier to automate the process. 

> While this work is definitely worth keeping, I'm only 20:80 to add
> something to the current (and future) version which belongs to older
> versions.
> 
> What about either uploading it to the FTP, or even better to add it
> to the web pages' CVS?

Adding it as a page to the website sounds good (it will also be easy
for people to find using common search engines).
I don't like the FTP idea so much - not very accesible unless you
know exactly what you're looking for.

Attached is a possible HTML page (and the patches in a subdirectory).

Comments welcomed,
 - assaf


coreutils-website-older-versions.tar.gz
Description: application/tar-gz


bug#36901: Enhance directory and file moves where target already exists

2019-08-03 Thread Assaf Gordon
Hello,

On Fri, Aug 02, 2019 at 10:47:18PM -0700, L A Walsh wrote:
> It's not a wish list that 'mv' doesn't work as documented.

The "wishlist" refers to the topic:
You are asking to add new funtionality to 'mv'.
That is a "wishlist" item.


(answering out of order:)

> > On 2019-08-02 9:56 p.m., L A Walsh wrote:
> >> But you say posix wants it to perform as a rename?
[...]
> >>
> >> So if I have:
> >> mkdir A B
> >> touch A/foo B/fee
> >> So when I look at the system call on linux for rename:
> >> oldpath can specify a directory.  In this case, newpath must
> >> either not
> >> exist, or it must specify an empty directory.
> >>  (complying with POSIX_C_SOURCE >= 200809L)
> >>
> >> So move should give an error: Nope:
> >>
> >> mv A B
> >>> tree B
> >> B
> >> ├── A
> >> │   └── foo
> >> └── fee
> >>
> >> 1 directory, 2 files
> >>
> >> So mv is violating POSIX - it didn't do the rename, but moved
> >> A under B and neither dir had to be empty.
> >>
> >> Saying it has to follow POSIX when it doesn't appear to, seems
> >> a bit contradictory?

I previously quoted one small part of the entire "mv" POSIX specification
(item #3, regarding using the 'rename(2)' function).

It would be wise to read the entire specification before making claims
about violating POSIX.
Specifically, at the top of the page:
https://pubs.opengroup.org/onlinepubs/9699919799/utilities/mv.html
   SYNOPSIS
  mv [-if] source_file target_file
  mv [-if] source_file... target_dir
   DESCRIPTION
  [...]
  In the second synopsis form, mv shall move each file named by a
  source_file operand to a destination file in the existing directory
  named by the target_dir operand [...] This second form is assumed
  when the final operand names an existing directory

In this regard GNU 'mv' is compliant with POSIX.


> > On 2019-08-02 9:56 p.m., L A Walsh wrote:
> >> On 2019/08/02 19:47, Assaf Gordon wrote:
> >>> Can new merging features be added to 'mv'? yes.
> >>> But it seems to me these would be better suited for 'higher level'
> >>> programs (e.g. a GUI file manager).
> >> ---
> >> If the command was named 'ren', then I'd expect it to be dummer,
> >> but 'mv'/move seem like it should be able to move files from
> >> one dir into another.
> >>
> >> But you say posix wants it to perform as a rename?
> >> I know, create a 're' command (or 'rn') for rename, and have
> >> it do what 'mv' would do.  Maybe posix would realize it would
> >> be better to have re/rn behave like rename, and 'mv' to
> >> behave it was moving something.

The Austin group (https://www.opengroup.org/austin/) who is in charge
of developing and maintaining the POSIX standard is the place
to go when wanting to change things in POSIX (or add new things).

You can write to them, suggest a modification,
and if they change the standard, GNU coreutils will surely follow.

As for renaming 'mv' or creating new 'rn' command -
part of POSIX is to codify existing behavior (that is - programs which
were in common use *before* POSIX).  It's not always logic, it's not always
ideal, but that's what has been in use for many years.

Based on mv's wiki page (https://en.wikipedia.org/wiki/Mv), 'mv' was
first introduced in 1971, 47 years ago.
With hindsight of nearly 5 decades it's easy to point to faults in a
program. If we were designing 'mv' today from scratch, I'm sure we would
improve many of its aspects.

But given that it is a long-standing program and its usage and quirks
are well established, I'm inclined to say it is highly unlikely
we will change mv's default behaviour or replace it with a different
name.

Adding new functionality (e.g. a new '--merge-directory' option)
is possible, and concrete patches are always welcomed.
However, given all the above, there is no guarentee that such new option
will be accepted.
I still think that such specific features are better suited for more
sophisticated programs (whether GUI or command line).

regards,
 - assaf








bug#36901: Enhance directory and file moves where target already exists

2019-08-02 Thread Assaf Gordon
severity 36901 wishlist
retitle 36901 mv: merge directories where target already exists
stop

Hello,

(for context: this is a new topic, diverged at https://bugs.gnu.org/36831#38 )

For completeness, quoting your second message ( from 
https://bugs.gnu.org/36831#50 ):

On 2019-08-02 9:56 p.m., L A Walsh wrote:
> 
> On 2019/08/02 19:47, Assaf Gordon wrote:
>> Can new merging features be added to 'mv'? yes.
>> But it seems to me these would be better suited for 'higher level'
>> programs (e.g. a GUI file manager).
> ---
>   But neither the person who posted the original bug on this
> nor I are using a GUI, we are running 'mv' GUI, we use the cmd line on
> linux, so that wouldn't
> be of any use.
> 
> If the command was named 'ren', then I'd expect it to be dummer,
> but 'mv'/move seem like it should be able to move files from
> one dir into another.
> 
> But you say posix wants it to perform as a rename?
> I know, create a 're' command (or 'rn') for rename, and have
> it do what 'mv' would do.  Maybe posix would realize it would
> be better to have re/rn behave like rename, and 'mv' to
> behave it was moving something.
> 
> So if I have:
> mkdir A B
> touch A/foo B/fee
> 
> So when I look at the system call on linux for rename:
> oldpath can specify a directory.  In this case, newpath must
> either not
> exist, or it must specify an empty directory.
>  (complying with POSIX_C_SOURCE >= 200809L)
> 
> So move should give an error: Nope:
> 
> mv A B
>> tree B
> B
> ├── A
> │   └── foo
> └── fee
> 
> 1 directory, 2 files
> 
> So mv is violating POSIX - it didn't do the rename, but moved
> A under B and neither dir had to be empty.
> 
> Saying it has to follow POSIX when it doesn't appear to, seems
> a bit contradictory?
> 







bug#36831: Enhance directory move. (was Re: bug#36831: enhance 'directory not empty' message)

2019-08-02 Thread Assaf Gordon

Hello,

On 2019-08-02 9:56 p.m., L A Walsh wrote:

On 2019/08/02 19:47, Assaf Gordon wrote:

Can new merging features be added to 'mv'? yes.
But it seems to me these would be better suited for 'higher level'
programs (e.g. a GUI file manager).

---
But neither the person who posted the original bug on this
nor I are using a GUI, we are running 'mv' GUI, we use the cmd line on linux, 
so that wouldn't
be of any use.


The original post was about the error *message*, asking to make it 
clearer. That is the topic of this thread (and the previous patch) -

so let's leave them at that.


I see you started a new thread ( https://bugs.gnu.org/36901 ),
so I'll reply there.






bug#36831: Enhance directory move. (was Re: bug#36831: enhance 'directory not empty' message)

2019-08-02 Thread Assaf Gordon
Hello,

On Fri, Aug 02, 2019 at 02:41:31AM -0700, L A Walsh wrote:
> On 2019/07/28 23:28, Assaf Gordon wrote:
> >
> >
> > $ mkdir A B B/A
> > $ touch A/bar B/A/foo
> > $ mv A B
> > mv: cannot move 'A' to 'B/A': Directory not empty
> >
> > And the reason (as you've found out) is that the target directory 'B/A'
> > is not empty (has the 'foo' file in it).
> > Had this been allowed, moving 'A' to 'B/A' would result in the 'foo'
> > file disappearing.
> >   
> ---
> Why must foo disappear?
> 
> Microsoft Windows handles this situation by telling the user that
> the target directory already exists and giving the option to *MERGE*
> the directories.
> 
> If you attempt to move a file into a directory that already contains
> a file by the same name, it pops up another notice asking [...]

Certainly, GUI programs (and more 'feature-rich' programs than 'mv')
offer many "merging" options.

I'm sure Midnight-Commander, KDE/Doplhine, XFCE/Thunar, Gnome/Nautilus
and many other free software GUI file managers have some "merging"
capabilities.

But 'mv' is more basic and does not have this capability.
Partly that is because it adheres to the POSIX standards, which
mandates:
"3. The mv utility shall perform actions equivalent to the
rename() function [...]"
https://pubs.opengroup.org/onlinepubs/9699919799/utilities/mv.html

Some rsync options (--remove-source-files) can mimick 'mv' with merging,
but then they are more like "copy+delete" than actual "rename/move".


Can new merging features be added to 'mv'? yes.
But it seems to me these would be better suited for 'higher level'
programs (e.g. a GUI file manager).

regards,
 - assaf





Re: building old coreutils versions on new glibc systemsy

2019-08-02 Thread Assaf Gordon
Hello,

On Fri, Aug 02, 2019 at 12:05:53AM -0700, Jim Meyering wrote:
> On Thu, Aug 1, 2019 at 7:48 PM Assaf Gordon  wrote:
> > The attached patches enable building old tarballs on modern systems
> > (tested on Debian 10 with GLIBC 2.28-10, gcc 8.3.0-6).
> >
> 
> Nice work. I've had to go through this process a few times over the
> years, and having these handy patch files checked in and maintained
> would make it easier to automate the process. I'm on the fence as to
> whether it's worth checking them in, given how few of us end up
> building all old versions like that. Selfishly, I want it. Now that I
> write this, I conclude it's worth the small cost. No need to
> distribute those files, of course, and anything that makes a
> maintainer's job easier (for such a small cost) is worthwhile.

Thanks.

Attached a patch to add these (+ README + build script)
to a new 'contrib' directory.

NOTE:
I had to disable the 'commit-msg' hook (for the 'contrib' prefix)
and the 'precommit' hook (because the patch files contains spaces at
end of lines, and lines longer than 80 characters).

Not sure if this is valid, or will cause troubles later on.
An alternative might be to gzip the patches before commiting?
Ideas welcomed.

-assaf



0001-contrib-document-how-to-build-older-versions-on-newe.patch.gz
Description: application/gunzip


Re: seq: fix bug of printing extra line

2019-08-02 Thread Assaf Gordon
On Fri, Aug 02, 2019 at 01:08:49PM +0100, Pádraig Brady wrote:
> On 02/08/19 03:28, Assaf Gordon wrote:
> > 
> > Prompted by the recent 'seq' thread, I spotted a bug in seq.
> > Fix attached.
> 
> Nice one. thanks!
> 

Thanks, pushed here:
https://git.savannah.gnu.org/cgit/coreutils.git/commit/?id=07f811a3c02d3d6dc1943030afccdfdcf7ac1e5e



Re: Add new command

2019-08-02 Thread Assaf Gordon
Hello,

On Fri, Aug 02, 2019 at 07:12:06PM +0430, Saeed Dehqan wrote:
> How do I add a command called rn?

In general,
please follow the instructions in the README-hacking
and HACKING files to prepare a patch for a new command.
https://git.savannah.gnu.org/cgit/coreutils.git/tree/README-hacking
https://git.savannah.gnu.org/cgit/coreutils.git/tree/HACKING

See past examples of such patches here:
 https://git.savannah.gnu.org/cgit/coreutils.git/log/?qt=grep=new+program
(Note all the files they modify.)

Since this will be a large contribution (i.e., more than 10 lines of
code), a copyright assignment will be required.
Please see here:
https://www.gnu.org/licenses/why-assign.en.html
Then fill and send this form:
https://git.savannah.gnu.org/cgit/gnulib.git/tree/doc/Copyright/request-assign.future

> This is an advanced command to rename large-scale files and accelerations.
> This command supported Regexes and Counters.

Before re-inventing the wheel, it is worth checking what other programs
exist for such functionality.
The basic "rename" program existed for some decades and allows regex renames.
Many other programs provide more advanced options (and even GUI),
like:
https://www.ostechnix.com/how-to-rename-multiple-files-at-once-in-linux/
https://packages.debian.org/search?keywords=rename

A command that replicate existing functionality is less likely to be
accepted in gnu coreutils.

regards,
 - assaf




bug#36831: enhance 'directory not empty' message

2019-08-01 Thread Assaf Gordon
On Thu, Aug 01, 2019 at 03:58:51PM -0700, Paul Eggert wrote:
> Thanks, that's better, but we're still missing some opportunities for 
> improvement.
> 
> > mv: cannot move 'A' to 'B/A': Target directory not empty
> 
> This should be "Destination" not "Target". 
[...] 
> You meant "mv" not "rm".
[...]
> > +static char*
> Space before "*".
[...]
> > +strerror_target (int e)
> Change name to "strerror_dest"
[...] 
> This function should return NULL instead of aborting when the errno value is
> inapplicable. That way, its callers need not hardcode which errno values it
> handles.

Thanks for the review and suggestions - attached an updated patch.

> Come to think of it, the same improvement should be made to ln, cp, install
> and shred. Basically, to any program that uses 'rename' or 'link' or similar
> syscalls, and which reports an error if the syscall fails.

OK, I will work on that next.

-assaf
>From 8dc6158a6fde668e55312b5fb69384f438b7e55a Mon Sep 17 00:00:00 2001
From: Assaf Gordon 
Date: Mon, 29 Jul 2019 00:23:20 -0600
Subject: [PATCH] mv: improve error messages when destination directory is at
 fault

Suggested by Alex Mantel  in
https://bugs.gnu.org/36831 .

$ mkdir A B B/A
$ touch A/bar B/A/foo

Before:

$ mv A B
mv: cannot move 'A' to 'B/A': Directory not empty

After:

$ mv A B
mv: cannot move 'A' to 'B/A': Destination directory not empty

The following errors are handled:
EDQUOT, EEXIST, ENOTEMPTY, EISDIR, ENOSPC, ETXTBSY.

* src/copy.c (copy_internal): Print custom messages for errors
that explicitly fault the destination directory.
(strerror_dest): New function, return custom, translatable error
messages for errors relating to 'destination' component.
* tests/mv/dir2dir.sh: Adjust expected error message.
* NEWS: Mention change.
---
 NEWS|  6 +
 src/copy.c  | 53 ++---
 tests/mv/dir2dir.sh |  8 ---
 3 files changed, 61 insertions(+), 6 deletions(-)

diff --git a/NEWS b/NEWS
index fd0543351..3d80665ae 100644
--- a/NEWS
+++ b/NEWS
@@ -44,6 +44,12 @@ GNU coreutils NEWS-*- 
outline -*-
   stat(1) also supports a new --cached= option to control cache
   coherency of file system attributes, useful on network file systems.
 
+** Improvements
+
+  mv now prints clearer error messages when a failure relates to the
+  destination directory (e.g., "Destination directory is not empty" instead
+  of "Directory not empty").
+
 
 * Noteworthy changes in release 8.31 (2019-03-10) [stable]
 
diff --git a/src/copy.c b/src/copy.c
index 65cf65895..602c8307b 100644
--- a/src/copy.c
+++ b/src/copy.c
@@ -1867,6 +1867,44 @@ source_is_dst_backup (char const *srcbase, struct stat 
const *src_st,
   return dst_back_status == 0 && SAME_INODE (*src_st, dst_back_sb);
 }
 
+/* Return custom error messages replacing the default libc's
+   messages. These messages explicity fault the destination component
+   in the error.
+
+   Return NULL if E (errno value) is not handled (and by implication
+   should use the system's default text for the error message).  */
+static char *
+strerror_dest (int e)
+{
+  /* TRANSLATORS: These strings should mimick libc's standard
+ error messages (from strerror(3)), but explicitly mention
+ the fault is with the destination directory. */
+  switch (errno)
+{
+case EDQUOT:
+  return _("Disk quota exceeded on destination device");
+case EEXIST:
+case ENOTEMPTY:
+  return _("Destination directory not empty");
+case EISDIR:
+  return _("Tried to overwrite a directory with a file");
+case ENOSPC:
+  return _("No space left on destination device");
+case ETXTBSY:
+  /* NOTE: The error is "Text file busy" - but "text" in that context
+ refers to "text segment" of an executable file (as opposed to
+ "data segment" and "BSS segment").
+
+ This error message is meant for users, and 'text file' can be easily
+ confused with an actual text file (i.e., one containing only ASCII
+ characters. Thus, say 'executable' instead of 'text'.*/
+  return _("Destination executable file is busy");
+default:
+  return NULL;
+}
+}
+
+
 /* Copy the file SRC_NAME to the file DST_NAME.  The files may be of
any type.  NEW_DST should be true if the file DST_NAME cannot
exist because its parent directory was just created; NEW_DST should
@@ -2477,9 +2515,18 @@ copy_internal (char const *src_name, char const 
*dst_name,
  If the permissions on the directory containing the source or
  destination file are made too restrictive, the rename will
  fail.  Etc.  */
-  error (0, rena

building old coreutils versions on new glibc systems

2019-08-01 Thread Assaf Gordon
Hello,

While trying to find out the first version with the 'seq' bug
(my previous email), I realized it has become quite hard to build
old coreutils version on newer glibc system.

In particular:
1. At some point 'gets' was removed from glibc, but old sources refer it.
2. Older gnulib used internal glibc symbols (libio.h) and the detection
method changed (_IO_ftrylockfile vs _IO_EOF_SEEN).
See:  https://git.sv.gnu.org/cgit/gnulib.git/commit/?id=74d9d6a2
3. Old coreutils defined 'futimens','tee','eaccess' functions which conflict
with later glibc functions of same name.

In short, it's not trivial to download a tarball from
https://ftp.gnu.org/gnu/coreutils/ and build it on modern systems
(and it seems even more complicated to build from git).

The attached patches enable building old tarballs on modern systems
(tested on Debian 10 with GLIBC 2.28-10, gcc 8.3.0-6).

The sequence should be:

wget https://ftp.gnu.org/gnu/coreutils/coreutils-5.97.tar.gz
tar -xf coreutils-5.97.tar.gz
cd coreutils-5.97
patch -p1 < ../coreutils-5.97-on-glibc-2.28.patch
./configure
make

Coreutils Versions Patch file
5.0coreutils-5.0-on-glibc-2.28.patch
5.97 to 6.9coreutils-5.97-on-glibc-2.28.patch
6.10   coreutils-6.10-on-glibc-2.28.patch
6.11   coreutils-6.11-on-glibc-2.28.patch
6.12   coreutils-6.12-on-glibc-2.28.patch
7.2  to 8.3coreutils-7.2-on-glibc-2.28.patch
8.4  to 8.12   coreutils-8.4-on-glibc-2.28.patch
8.13 to 8.16   coreutils-8.13-on-glibc-2.28.patch
8.17   coreutils-8.17-on-glibc-2.28.patch
8.18 to 8.23   coreutils-8.18-on-glibc-2.28.patch
8.24 to 8.29   coreutils-8.24-on-glibc-2.28.patch
8.30 and newer [builds without patching]


Hope this helps someone.

regards,
 - assaf
diff -r -U3 coreutils-5.0/src/Makefile.in coreutils-5.0-patched/src/Makefile.in
--- coreutils-5.0/src/Makefile.in   2003-04-02 07:46:19.0 -0700
+++ coreutils-5.0-patched/src/Makefile.in   2019-08-01 19:38:07.440997426 
-0600
@@ -209,7 +209,7 @@
 printf_LDADD = $(LDADD) @POW_LIB@ @LIBICONV@
 
 # If necessary, add -lm to resolve use of floor, rint, modf.
-seq_LDADD = $(LDADD) @SEQ_LIBM@
+seq_LDADD = $(LDADD) @SEQ_LIBM@ -lm
 
 # If necessary, add -lm to resolve the `pow' reference in lib/strtod.c
 # or for the fesetround reference in programs using nanosec.c.
diff -r -U3 coreutils-5.0/src/tee.c coreutils-5.0-patched/src/tee.c
--- coreutils-5.0/src/tee.c 2002-12-15 07:21:45.0 -0700
+++ coreutils-5.0-patched/src/tee.c 2019-08-01 19:34:32.374301325 -0600
@@ -32,7 +32,7 @@
 
 #define AUTHORS N_ ("Mike Parker, Richard M. Stallman, and David MacKenzie")
 
-static int tee (int nfiles, const char **files);
+static int tee_FOO (int nfiles, const char **files);
 
 /* If nonzero, append to output files rather than truncating them. */
 static int append;
@@ -146,7 +146,7 @@
   /* Do *not* warn if tee is given no file arguments.
  POSIX requires that it work when given no arguments.  */
 
-  errs = tee (argc - optind, (const char **) [optind]);
+  errs = tee_FOO (argc - optind, (const char **) [optind]);
   if (close (STDIN_FILENO) != 0)
 error (EXIT_FAILURE, errno, _("standard input"));
 
@@ -158,7 +158,7 @@
Return 0 if successful, 1 if any errors occur. */
 
 static int
-tee (int nfiles, const char **files)
+tee_FOO (int nfiles, const char **files)
 {
   FILE **descriptors;
   char buffer[BUFSIZ];
diff -r -U3 coreutils-5.0/src/test.c coreutils-5.0-patched/src/test.c
--- coreutils-5.0/src/test.c2003-02-10 02:19:09.0 -0700
+++ coreutils-5.0-patched/src/test.c2019-08-01 19:35:52.871307966 -0600
@@ -139,7 +139,7 @@
 /* Do the same thing access(2) does, but use the effective uid and gid.  */
 
 static int
-eaccess (char const *file, int mode)
+eaccess_FOO (char const *file, int mode)
 {
   static int have_ids;
   static uid_t uid, euid;
@@ -635,17 +635,17 @@
 
 case 'r':  /* file is readable? */
   unary_advance ();
-  value = -1 != eaccess (argv[pos - 1], R_OK);
+  value = -1 != eaccess_FOO (argv[pos - 1], R_OK);
   return (TRUE == value);
 
 case 'w':  /* File is writable? */
   unary_advance ();
-  value = -1 != eaccess (argv[pos - 1], W_OK);
+  value = -1 != eaccess_FOO (argv[pos - 1], W_OK);
   return (TRUE == value);
 
 case 'x':  /* File is executable? */
   unary_advance ();
-  value = -1 != eaccess (argv[pos - 1], X_OK);
+  value = -1 != eaccess_FOO (argv[pos - 1], X_OK);
   return (TRUE == value);
 
 case 'O':  /* File is owned by you? */
diff -r -U3 coreutils-6.4/lib/utimens.c coreutils-6.4-patched/lib/utimens.c
--- coreutils-6.4/lib/utimens.c 2006-09-14 03:53:59.0 -0600
+++ 

seq: fix bug of printing extra line

2019-08-01 Thread Assaf Gordon
Hello,

Prompted by the recent 'seq' thread, I spotted a bug in seq.
Fix attached.
I think it does not introduce any regressions, but review and comments
are very welcomed.

-assaf
>From 52505fe73fb00a30435009895d03fa3bba1297a4 Mon Sep 17 00:00:00 2001
From: Assaf Gordon 
Date: Thu, 1 Aug 2019 17:01:21 -0600
Subject: [PATCH] seq: fix superfluous output line

Under certain circumstances seq prints an extra line when the output
format has custom format with characters following the printed numbers:

$ seq -f "%g " 100 100
 1e+06
 1e+06

This is due to the "print_extra_number" logic using strings to determine
whether a 'extra number' is needed, but only one string was trimmed
when using a custom printf format.

Prompted by https://lists.gnu.org/r/coreutils/2019-08/msg1.html

* NEWS: Mention fix.
* src/seq.c (print_numbers): Trim the 'x0_str' string before comparing
it to the previous 'x_str' string.
* tests/misc/seq-extra-number.sh: Add this scenario.
* tests/local.mk (all_tests): Add new test.
---
 NEWS   |  4 +++
 src/seq.c  |  4 ++-
 tests/local.mk |  1 +
 tests/misc/seq-extra-number.sh | 47 ++
 4 files changed, 55 insertions(+), 1 deletion(-)
 create mode 100755 tests/misc/seq-extra-number.sh

diff --git a/NEWS b/NEWS
index fd0543351..97c9d18bd 100644
--- a/NEWS
+++ b/NEWS
@@ -34,6 +34,10 @@ GNU coreutils NEWS-*- 
outline -*-
   for --numeric, --hex, or default alphabetic suffixes respectively.
   [bug introduced in coreutils-8.24]
 
+  seq no longer prints an extra line under certain circumstances (such as
+  'seq -f "%g " 100 100').
+  [bug introduced in coreutils-6.10]
+
 ** New Features
 
   od --skip-bytes now can use lseek even if the input is not a regular
diff --git a/src/seq.c b/src/seq.c
index b5913368a..8efe929e1 100644
--- a/src/seq.c
+++ b/src/seq.c
@@ -340,8 +340,10 @@ print_numbers (char const *fmt, struct layout layout,
   && x_val == last)
 {
   char *x0_str = NULL;
-  if (asprintf (_str, fmt, x0) < 0)
+  int x0_strlen = asprintf (_str, fmt, x0);
+  if (x0_strlen < 0)
 xalloc_die ();
+  x0_str[x0_strlen - layout.suffix_len] = '\0';
   print_extra_number = !STREQ (x0_str, x_str);
   free (x0_str);
 }
diff --git a/tests/local.mk b/tests/local.mk
index e88d99f24..3e347cd96 100644
--- a/tests/local.mk
+++ b/tests/local.mk
@@ -245,6 +245,7 @@ all_tests = \
   tests/misc/test.pl   \
   tests/misc/seq.pl\
   tests/misc/seq-epipe.sh  \
+  tests/misc/seq-extra-number.sh   \
   tests/misc/seq-io-errors.sh  \
   tests/misc/seq-locale.sh \
   tests/misc/seq-long-double.sh\
diff --git a/tests/misc/seq-extra-number.sh b/tests/misc/seq-extra-number.sh
new file mode 100755
index 0..4295e1791
--- /dev/null
+++ b/tests/misc/seq-extra-number.sh
@@ -0,0 +1,47 @@
+#!/bin/sh
+# Test the "print_extra_number" logic seq.c:print_numbers()
+
+# Copyright (C) 2019 Free Software Foundation, Inc.
+
+# This program is free software: you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation, either version 3 of the License, or
+# (at your option) any later version.
+
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+
+# You should have received a copy of the GNU General Public License
+# along with this program.  If not, see <https://www.gnu.org/licenses/>.
+
+. "${srcdir=.}/tests/init.sh"; path_prepend_ ./src
+print_ver_ seq
+
+##
+## Test 1: the documented reason for the logic
+##
+cat<<'EOF'>exp1 || framework_failure_
+0.00
+0.01
+0.02
+0.03
+EOF
+
+seq 0 0.01 0.03 > out1 || fail=1
+compare exp1 out1 || fail=1
+
+
+##
+## Test 2: before 8.32, this resulted in TWO lines
+## (print_extra_number was erroneously set to true)
+## The '=' is there instead of a space to ease visual inspection,
+cat<<'EOF'>exp2 || framework_failure_
+1e+06=
+EOF
+
+seq -f "%g=" 100 100 > out2 || fail=1
+compare exp2 out2 || fail=1
+
+Exit $fail
-- 
2.20.1



bug#36831: enhance 'directory not empty' message

2019-08-01 Thread Assaf Gordon
Hello,

On Wed, Jul 31, 2019 at 08:03:45PM -0700, Paul Eggert wrote:
> Assaf Gordon wrote:
> > An explicit error explicitly saying "cannot move", and mention the source 
> > and
> > destination, and also "blames" the target directory seems the most
> > user-friendly and least ambiguous.
> 
> Sure, but that handles only the ENOTEMPTY/EEXIST case. How would you handle
> the EDQUOT, EISDIR, and ENOSPC cases? Will you invent a separate diagnostic
> for each case, or just treat them as in my proposed patch? I assume the
> latter, but either way I'd like to see a patch that handles these properly
> too. Also, please handle ETXTBUSY while you're at it (sorry, I missed that
> one).
> 
> > For the second and third cases,
> > "No space" and "Quota exceeded" seem to me to always relate to the
> > destination, and I don't think users get confused about those
> > (other opinions of course welcomed).
> 
> What's obvious to experts like us is not always obvious to users. If users
> get confused by the current diagnostic for ENOTEMPTY/EEXIST, I don't see why
> they wouldn't also get confused for ETXTBUSY etc.
> 
> > Your patch also added "EISDIR", for which rename(2) says:
> >  "newpath is an existing directory, but oldpath is not a directory."
> > 
> > But I don't think this error can happen with gnu mv.
> 
> It can, as a result of a race condition if some other process is mutating
> the file system while 'mv' is running. Admittedly unlikely, but we might as
> well improve this errno value while we're improving the others.

All good points.

Please see attached updated version.

It does add explicit error string for each error code, but I hope the
implementation is reasonable and easy to maintain and translate.

-assaf
>From 8ee71b24d74d7cfe81f151de430d38935cf04675 Mon Sep 17 00:00:00 2001
From: Assaf Gordon 
Date: Mon, 29 Jul 2019 00:23:20 -0600
Subject: [PATCH] mv: improve error messages when target directory is at fault

Suggested by Alex Mantel  in
https://bugs.gnu.org/36831 .

$ mkdir A B B/A
$ touch A/bar B/A/foo

Before:

$ mv A B
mv: cannot move 'A' to 'B/A': Directory not empty

After:

$ mv A B
mv: cannot move 'A' to 'B/A': Target directory not empty

The following errors are handled:
EDQUOT, EEXIST, ENOTEMPTY, EISDIR, ENOSPC, ETXTBSY.

* src/copy.c (copy_internal): Print custom messages for errors
that explicitly fault the target directory.
(strerror_target): New function, return custom and translatable error
messages.
* tests/mv/dir2dir.sh: Adjust expected error message.
* NEWS: Mention change.
---
 NEWS|  6 +
 src/copy.c  | 56 ++---
 tests/mv/dir2dir.sh |  6 ++---
 3 files changed, 62 insertions(+), 6 deletions(-)

diff --git a/NEWS b/NEWS
index fd0543351..4ec4d0df0 100644
--- a/NEWS
+++ b/NEWS
@@ -44,6 +44,12 @@ GNU coreutils NEWS-*- 
outline -*-
   stat(1) also supports a new --cached= option to control cache
   coherency of file system attributes, useful on network file systems.
 
+** Improvements
+
+  rm now prints clearer error messages when a failure relates to the
+  target directory (e.g., "Target directory is not empty" instead of
+  "Directory not empty").
+
 
 * Noteworthy changes in release 8.31 (2019-03-10) [stable]
 
diff --git a/src/copy.c b/src/copy.c
index 65cf65895..9cf02ad9c 100644
--- a/src/copy.c
+++ b/src/copy.c
@@ -1867,6 +1867,38 @@ source_is_dst_backup (char const *srcbase, struct stat 
const *src_st,
   return dst_back_status == 0 && SAME_INODE (*src_st, dst_back_sb);
 }
 
+static char*
+strerror_target (int e)
+{
+  /* TRANSLATORS: These strings should mimick libc's standard
+ error messages (from strerror(3)), but explicitly mention
+ the fault is with the target directory. */
+  switch (errno)
+{
+case EDQUOT:
+  return _("Disk quota exceeded on target device");
+case EEXIST:
+case ENOTEMPTY:
+  return _("Target directory not empty");
+case EISDIR:
+  return _("Tried to overwrite a directory with a file");
+case ENOSPC:
+  return _("No space left on target device");
+case ETXTBSY:
+  /* NOTE: The error is "Text file busy" - but "text" in that context
+ refers to "text segment" of an executable file (as opposed to
+ "data segment" and "BSS segment").
+
+ This error message is meant for users, and 'text file' can be easily
+ confused with an actual text file (i.e., one containing only ASCII
+ characters. Thus, say 'executable' instead of 'text'.*/
+  return _("Target executable file is busy");
+default:
+  assert (0);
+}
+}
+
+
 /* Co

Re: date: new options to parse input date with strptime(3)

2019-08-01 Thread Assaf Gordon
Hello,

Thank you for the review.
(replying to both emails together)

On Wed, Jul 31, 2019 at 04:27:20PM +0100, Stephane Chazelas wrote:
> 2019-07-31 14:59:42 +0100, Pádraig Brady:
> > On 26/07/19 08:29, Assaf Gordon wrote:
> [...]
> > > The first patch adds '--date-format=FORMAT', where FORMAT is
> > > strptime(3) format.
> > 
> > I like this, and think it's useful functionality.
> > It's equivalent to -f in date(1) on FreeBSD,
> > so we should probably support that short option
> [...]
> 
> Note that busybox date has -D for that.

In gnu date(1), -f is already assigned to "--file" (batch processing).
I added the "-D" short option.

> [...] you can use the standard getdate() DATEMSK variable [...]

Based on past coreutils policies, I think new environment variables
won't be accepted to any program...

>> The second patch adds '--arith-format=FORMAT', where FORMAT is
>> limited
>> to years/months/days/hours/minutes/seconds (%Y/%m/%d/%H/%M/%S).
> 
> The idea here is to support more generic numeric deltas.
> I'm not sure of the interface though. Perhaps --delta-format
> would be clearer.

I changed it to "--date-delta-format" (to match -D/--date-format).

Note that there's a difference between this and freebsd's -v:
The "--date-delta-format" takes the values from the same date string
(-d), so it also works with "--file" (batch processing).

> Or perhaps we should just support the
> FreeBSD -v option to apply the adjustments, which seems more direct
> and would further improve compat.

I like the FreeBSD -v method, and implemented it as well (in two
patches, to ease review).
The commit messages and tests provide many examples.

---

There could be many adjustment to these features, but I hope that if the
bulk of the code exists, adapting it will be easy.
One option, for example, is to do away with "--date-delta-format",
and accept the "-v" syntax in the "-d" string, so it will work
both from the command line and from a file:

date -D "%F" -d "2019-10-31" -v "+2y -100h"
printf "2019-10-31 +2y -100h" | date -D "%F" -f -

---

The attached patches are:

  tests: add 'date -r/--reference=FILE' test
  tests: add 'date -f/--file' (batch processing) test
  date: add -D/--date-format=FORMAT option
  date: add --date-delta-format=FORMAT option
  date: add -v/--adjust-date=STRING option
  date: expand -v=STR syntax to match FreeBSD

---

Comments and suggestions welcomed,
 - assaf




date-strp-2019-08-01.patch.gz
Description: application/gunzip


Re: How to convert a md5sum back to a timestamp?

2019-08-01 Thread Assaf Gordon

Hello,


On 2019-08-01 12:50 a.m., Stephane Chazelas wrote:

2019-07-31 22:36:18 -0500, Peng Yu:


Suppose that I know a md5sum that is derived one of the timestamps
computed below. Is there a way to quickly derive what the original
timestamp is? I could make a database of all the timestamps and their
md5sums. But as the total number of entries increases, this solution
will not be scalable as the database can be big. Is it there any
better solution to this problem?

for i in {1..2563200}; do date -d "-$i minutes" +%Y%m%d_%I%M%p; done

[...]

seq -f '-%g minutes' 2563200 | date -f - +%Y%m%d_%I%M%p

would be an improvement as it would only run one date
invocation, but you'd still need to run one md5sum for each of
those lines. coreutils md5sum in itself is not slow, but forking
a process and loading a command and linking its libraries is,
that's not a bug in coreutils itself.



"datamash" will calculate md5 on multiple lines in one invocation:

   $ seq -f '-%g minutes' 2563200 \
   | date -f - +%Y%m%d_%I%M%p \
   | datamash md5 1

or to see the time AND the md5 sum, add "--full":

   $ seq -f '-%g minutes' 2563200 \
   | date -f - +%Y%m%d_%I%M%p \
   | datamash --full md5 1

Three notes:
1.
I would recommend using "-%7.0f minutes" format in "seq"
instead of "%g", as the latter will result in a scientific notation
for large values:

   $ seq -f '-%7g minutes' 2563200 | tail -n1
   -2.5632e+06 minutes

   $ seq -f '-%7.0f minutes' 2563200 | tail -n1
   -2563200 minutes

2.
Using "-N minutes" as a date format is relative to the current time.
Are you sure that's the value you want? you'll get different values
every time you run it...
To be more reproducible,  consider starting with a known date, e.g.:

   $ date -u  -d "2019-08-01 01:53:22Z +55 minutes" +%Y%m%d_%I%M%p
   20190801_0248AM

or
   $ seq -f "2019-08-01 01:53:22Z +%7.0f minutes" 2563200 \
   | date -u -f - +%Y%m%d_%I%M%p | head
   20190801_0154AM


3.
Using "datamash md5" does not include the newline for the md5
calculation, be careful about this when comparing hashing results.
e.g.:

$ echo 20190731_0848PM | md5sum
deb75bda7f8e95d321897d181cbe2556  -

$ printf "%s\n" 20190731_0848PM | md5sum
deb75bda7f8e95d321897d181cbe2556  -

$ printf "%s" 20190731_0848PM | md5sum
d0bf332197593b7c3f6d7757f7d5754a  -

$ printf "%s" 20190731_0848PM | datamash md5 1
d0bf332197593b7c3f6d7757f7d5754a


---

For reference, on my old desktop it takes:

$ time seq -f "2019-08-01 01:53:22Z +%7.0f minutes" 2563200 \
  | date -u -f - +%Y%m%d_%I%M%p \
  | datamash --full md5 1 | wc -l -c
2563200 125596800

real0m14.185s
user0m17.739s
sys 0m0.527s

And results in ~125MB of data - reasonable for an ad-hoc reverse
lookup table for MD5 values.

If you key space gets larger, you should look into 
https://en.wikipedia.org/wiki/Rainbow_table .


Hope this helps,
 - assaf



bug#36831: enhance 'directory not empty' message

2019-07-31 Thread Assaf Gordon
Hello Paul,

On Mon, Jul 29, 2019 at 06:50:46PM -0500, Paul Eggert wrote:
> On 7/29/19 1:28 AM, Assaf Gordon wrote:
> > +  if (rename_errno == ENOTEMPTY || rename_errno == EEXIST)
> > +{
> > +  error (0, 0, _("cannot move %s to %s: Target directory not 
> > empty"),
> > + quoteaf_n (0, src_name), quoteaf_n (1, dst_name));
> 
> Although this is an improvement, it is not general enough, as other errno
> values are relevant only for the destination. Better would be to have a
> special case for errno values that matter only for the destination, and use
> the existing code for errno values where we don't know whether the problem
> is the source or the destination. Something like the attached, say.

> +case EDQUOT: case EEXIST: case EISDIR: case ENOSPC: case 
> ENOTEMPTY:
> +  error (0, rename_errno, "%s", quotearg_colon (dst_name));
> +  break;
> +

Thanks for the review.

At the risk of bikeshedding, I'd like to argue for the prior method.
While it is not general enough, I think it provides a clearer error message.

For example, with the more general implementation the errors would be:

  $ mv A B
  mv: B/A: Directory not empty

  $ mv A B
  mv: B/A: No space left on device

  $ mv A B
  mv: B/A: Quota exceeded

In the first case,
I think this error is potentially more confusing than
before: while it doesn't mention the source directory, it also doesn't
say "cannot move" - so it is only implied it is an error (an
inexperienced user might dismiss this as a warning).

Also, it could be that there will be a source directory named very similarly
to the destination directory, and from a quick glace it would not be easy to
understand what happened.

An explicit error explicitly saying "cannot move", and mention the source and
destination, and also "blames" the target directory seems the most
user-friendly and least ambiguous.

---

For the second and third cases,
"No space" and "Quota exceeded" seem to me to always relate to the
destination, and I don't think users get confused about those
(other opinions of course welcomed).

---

Your patch also added "EISDIR", for which rename(2) says:
 "newpath is an existing directory, but oldpath is not a directory."

But I don't think this error can happen with gnu mv.
If we try to move a file onto a directory, we get:

  $ mkdir C C/D ; touch D
  $ mv D C
  mv: cannot overwrite directory 'C/D' with non-directory

And this case is specifically handled in copy.c line 2131, before
calling rename(2)  (and also this is an example of a custom error
message instead of using stock libc messages).

---

Happy to hear your opinion,
 - assaf







bug#36831: enhance 'directory not empty' message

2019-07-29 Thread Assaf Gordon
Hello,

On Sun, Jul 28, 2019 at 08:58:59PM +0200, Alex Mantel wrote:
[...] 
> Ah, the target directory does exist! Hmm... But i'd like the message to be
> like:
> 
>    $ mv thing/ ../things
>    mv: cannot move 'thing' to '../things/things': Targetdirectory not empty
> 
>   ^ this little thing here,
>     it explains everyting.
> 
> Change text from 'Directory not empty' to 'Targetdirectory not empty'.

Thanks for the report.

To clarify, the scenario is:

$ mkdir A B B/A
$ touch A/bar B/A/foo
$ mv A B
mv: cannot move 'A' to 'B/A': Directory not empty

And the reason (as you've found out) is that the target directory 'B/A'
is not empty (has the 'foo' file in it).
Had this been allowed, moving 'A' to 'B/A' would result in the 'foo'
file disappearing.

---

How is a user expecting to know this error is about that target
directory?

There is a bit of a trade-off here between user-friendliness (especially
for non-technical user) and more technical knowledge.
If we go one step 'lower' to the programming interface, almost all
sources mention this is about the 'target' directory not being empty:

POSIX's says:
https://pubs.opengroup.org/onlinepubs/009695399/functions/rename.html
[EEXIST] or [ENOTEMPTY]
The link named by new is a directory that is not an empty directory.

Linux's rename(2) manual page says:
ENOTEMPTY or EEXIST
newpath is a nonempty directory, that is, contains entries
other than "." and "..".

FreeBSD's rename(2) manual page says:
[ENOTEMPTY]The to argument is a directory and is not empty.

AIX rename(2) manual page says:
 ENOTEMPTY
   The ToPath parameter specifies an existing directory that is
   not empty.


So there is some merit in claiming this helpful piece of information is
lost when the error message is reported to the user.

---

In GNU coreutils this error message originates from 'copy.c' line 2480:
https://git.savannah.gnu.org/cgit/coreutils.git/tree/src/copy.c#n2480

error (0, rename_errno,
  _("cannot move %s to %s"),
  quoteaf_n (0, src_name), quoteaf_n (1, dst_name));

And herein lies the (technical) problem: The actual message "Directory
not empty" is not in the source code - it is a system error message
that corresponds to the value of 'rename_errno' variable
(ENOTEMPTY/EEXIST). It originates from GLibc (or another libc).

So there is no trivial way to change the error message in coreutils.

Attached a patch to add special handling for this error.

---

What do others think? If this is a desired improvement, I'll finish the
patch with news/tests/etc.


regards,
 - assaf
>From 430b30104234db719bf15e6fc681a62312c7124f Mon Sep 17 00:00:00 2001
From: Assaf Gordon 
Date: Mon, 29 Jul 2019 00:23:20 -0600
Subject: [PATCH] mv: improve ENOTEMPTY/EEXIST error message

Suggested by Alex Mantel  in
https://bugs.gnu.org/36831 .

$ mkdir A B B/A
$ touch A/bar B/A/foo

Before:

$ mv A B
mv: cannot move 'A' to 'B/A': Directory not empty

After:

$ mv A B
mv: cannot move 'A' to 'B/A': Target directory not empty

* src/copy.c (copy_internal): Add special handling for ENOTEMPTY/EEXIST.
TODO: NEWS, tests.
---
 src/copy.c | 8 
 1 file changed, 8 insertions(+)

diff --git a/src/copy.c b/src/copy.c
index 65cf65895..a5af570bf 100644
--- a/src/copy.c
+++ b/src/copy.c
@@ -2450,6 +2450,14 @@ copy_internal (char const *src_name, char const 
*dst_name,
   return true;
 }
 
+  if (rename_errno == ENOTEMPTY || rename_errno == EEXIST)
+{
+  error (0, 0, _("cannot move %s to %s: Target directory not empty"),
+ quoteaf_n (0, src_name), quoteaf_n (1, dst_name));
+  forget_created (src_sb.st_ino, src_sb.st_dev);
+  return false;
+}
+
   /* WARNING: there probably exist systems for which an inter-device
  rename fails with a value of errno not handled here.
  If/as those are reported, add them to the condition below.
-- 
2.11.0



date: new options to parse input date with strptime(3)

2019-07-26 Thread Assaf Gordon
Hello,

Some time ago there was a discussion relating to diffuculties of using
GNU date's parsing. There was a mention of how using strptime(3) makes
parsing explicit and easy.

I like that idea, and decided to try my hand at adding such options.

Attached is a proof of concept.

The first patch adds '--date-format=FORMAT', where FORMAT is
strptime(3) format.
The second patch adds '--arith-format=FORMAT', where FORMAT is limited
to years/months/days/hours/minutes/seconds (%Y/%m/%d/%H/%M/%S).

Examples:

  # Specific date
  $ ./src/date --date-format '%d %b %Y' --date '17 Feb 1979' +%F
  1979-02-17

  # The 100th day of 2019
  $ ./src/date --date-format '%Y %j' --date '2019 100' +%F
  2019-04-10

  # Tuesday of the 10th week in 2018
  $ ./src/date --date-format '%Y %W %A' --date '2018 10 Tue' +%F
  2018-03-06

  # 2019-07-26 18:49:59, +49 hours, -10 minutes, -30 seconds:
  $ date --date-format '%Y%m%d %H%M%S' \
 --arith-format '%H %M %S' \
 --date '20190726 184959 49 -10 -30' \
'+%F %T'
  2019-07-28 19:39:29

The test file (date-strp.pl) contains more usage examples.

This is just a proof of concept, and of course many things can be
improved and changed (assuming this feature is desired).

Comments and suggestions very welcomed,
 - assaf
>From 82c8b42de7bf9c69432ff175838f01f10008a512 Mon Sep 17 00:00:00 2001
From: Assaf Gordon 
Date: Thu, 25 Jul 2019 02:35:46 -0600
Subject: [PATCH 1/2] date: add --date-format=FORMAT option

Parse -d=STRING dates using strptime(3) instead of gnulib's
parse_datetime.c heuristics.

Example: print the 100th day of 2019:

  $ date --date-format '%Y %j' --date '2019 100' +%F
  2019-04-10

TODO: coreutils.texi, NEWS, usage

* src/date.c (long_options): Add --date-format/STRP_FORMAT option.
(parse_datetime_flags): Replace with ...
(debug): ... new variable.
(strp_format): New variable to hold the user-specified FORMAT string.
(parse_datetime_string): New function, wrapper for
parse_datetime2/strptime.
(batch_convert, main): Call parse_datetime_string instead of
parse_datetime2.
(main): Handle STRP_FORMAT option.
* tests/misc/date-strp.pl: New tests.
* tests/local.mk (TESTS): Add date-strp.pl
---
 src/date.c  |  78 ++---
 tests/local.mk  |   1 +
 tests/misc/date-strp.pl | 151 
 3 files changed, 221 insertions(+), 9 deletions(-)
 create mode 100644 tests/misc/date-strp.pl

diff --git a/src/date.c b/src/date.c
index d97d0ae52..4879474e3 100644
--- a/src/date.c
+++ b/src/date.c
@@ -80,7 +80,8 @@ static char const rfc_email_format[] = "%a, %d %b %Y %H:%M:%S 
%z";
 enum
 {
   RFC_3339_OPTION = CHAR_MAX + 1,
-  DEBUG_DATE_PARSING
+  DEBUG_DATE_PARSING,
+  STRP_FORMAT
 };
 
 static char const short_options[] = "d:f:I::r:Rs:u";
@@ -97,6 +98,7 @@ static struct option const long_options[] =
   {"rfc-2822", no_argument, NULL, 'R'},
   {"rfc-3339", required_argument, NULL, RFC_3339_OPTION},
   {"set", required_argument, NULL, 's'},
+  {"date-format", required_argument, NULL, STRP_FORMAT},
   {"uct", no_argument, NULL, 'u'},
   {"utc", no_argument, NULL, 'u'},
   {"universal", no_argument, NULL, 'u'},
@@ -105,8 +107,11 @@ static struct option const long_options[] =
   {NULL, 0, NULL, 0}
 };
 
-/* flags for parse_datetime2 */
-static unsigned int parse_datetime_flags;
+static bool debug ;
+
+/* the strp format string specified by the user */
+static char* strp_format;
+
 
 #if LOCALTIME_CACHE
 # define TZSET tzset ()
@@ -142,6 +147,9 @@ Display the current time in the given FORMAT, or set the 
system date.\n\
   -d, --date=STRING  display time described by STRING, not 'now'\n\
 "), stdout);
   fputs (_("\
+  --date-format=FORMAT   parse -d,-f values according to FORMAT\n\
+"), stdout);
+  fputs (_("\
   --debugannotate the parsed date,\n\
   and warn about questionable usage to stderr\n\
 "), stdout);
@@ -281,6 +289,57 @@ Show the local time for 9AM next Friday on the west coast 
of the US\n\
   exit (status);
 }
 
+/* A wrapper calling either gnulib's parse_datetime2() or strptime(3),
+   depending on whether the user specified --date-format=FORMAT argument.  */
+static bool
+parse_datetime_string (struct timespec *result, char const *datestr,
+   timezone_t tzdefault, char const *tzstring)
+{
+  if (strp_format)
+{
+  struct tm t;
+  time_t s = time (NULL);
+  localtime_rz (tzdefault, , );
+  char *endp = strptime (datestr, strp_format,  );
+  if (!endp)
+{
+  if (debug)
+error (0, 0, _("date string %s does not match format '%s'"),
+   quotearg (datestr),
+   strp_format);
+  return false;
+}
+
+  if (*endp)
+{
+  if (debug)
+error (EXIT_FAIL

Re: doc: add "version sort" chapter

2019-07-23 Thread Assaf Gordon

On 2019-07-22 11:56 p.m., Bernhard Voelker wrote:

On 7/15/19 9:32 PM, Assaf Gordon wrote:

[...] pushed [...]


'make check' fails for sc-avoid-builtin, and I propose some other
fixes in the attached as well.
WDYT?



These all look good, thanks for the improvements!





Re: doc: add "version sort" chapter

2019-07-15 Thread Assaf Gordon

On 10/07/19 19:57, Assaf Gordon wrote:


I would like to suggest adding a new chapter to the manual,
detailing the nitty-gritties of "version sort" in coreutils.



Attached the updated version,
including improvements Bernhard sent off-list.

Comments welcomed,


With no further comments, pushed here:

https://git.savannah.gnu.org/cgit/coreutils.git/commit/?id=3264d4ca0d4fd5477d2232c3e097422efdd669ec

-assaf





bug#36674: Sort Suggestion

2019-07-15 Thread Assaf Gordon
tag 36674 notabug
close 36674
stop

Hello,

On Mon, Jul 15, 2019 at 11:42:01AM -0700, Marshall Lake wrote:
> Even though this isn't a bug, I was asked to send the following to this
> email address.

(General suggestions and discussions are better suited for
coreut...@gnu.org mailing list, that way the system won't open a new
bug item.)

> 
> Re:  SORT Command from GNU coreutils 8.25
> 
> A suggestion for an additional option to the SORT command is to ignore
> non-alphanumeric characters.
> 
> As an example, in attempting to sort an index ...
> 
> Abbott, William259
> 
> sorts before:
> 
> Abbot, William 099
> 
> If non-alphanumeric characters were ignored then the same two records
> would sort as:
> 
> Abbot, William 099
> Abbott, William259
> 
> 

There's actually something else at play here:
In your case, sort does ignore non-alphanumeric characters,
but it ALSO ignores white space.
That happens because your locale is set to some language
(for example, en_US.UTF8).

Using such locale makes sort ignore all non-alphanumeric chareacters,
whitespace, and upper/lower cases.

In essense, you are compaing "AbbottWilliam" (two 't's) to
'AbbotWilliam' (one 't') - and then the second 't' is compared to a 'w',
and is determined to come first.

If you force a POSIX/C locate, then all characters are considered,
and the result will be as you requested.

Observe the following:

  $ printf "%s\n" AbbottWilliam AbbotWilliam | LC_ALL=en_CA.utf8 sort
  AbbottWilliam
  AbbotWilliam

  $ printf "%s\n" "Abbott William" "Abbot William" | LC_ALL=en_CA.utf8 sort
  Abbott William
  Abbot William

  $ printf "%s\n" "Abbott William" "Abbot William" | LC_ALL=C sort
  Abbot William
  Abbott William

  $ printf "%s\n" "Abbott, William" "Abbot, William" | LC_ALL=C sort
  Abbot, William
  Abbott, William

Note that 'sort' already has an option for dictionary style sorting:
   -d, --dictionary-order: consider only blanks and alphanumeric characters.

However, locale rules take precedence over it, so effectively it only
works in "C" locale:

  $ printf "%s\n" "Ab,,b,,ott William" "Abbot William" | LC_ALL=C sort
  Ab,,b,,ott William
  Abbot William

  $ printf "%s\n" "Ab,,b,,ott William" "Abbot William" | LC_ALL=C sort -d
  Abbot William
  Ab,,b,,ott William


You can read past discussion about the confusion resulting from locale
sorting rules here:
   https://debbugs.gnu.org/11621
   https://debbugs.gnu.org/12783


As such, I'm closing this as "not a bug", but discussion can continue
by replying to this thread.

-assaf






bug#36671: tail: unrecognized file system type 0x794c7630 for ‘/var/log/messages’. please report this to bug-coreutils@gnu.org. reverting to polling

2019-07-15 Thread Assaf Gordon
tag 36671 notabug
close 36671
stop

Hello,

On Mon, Jul 15, 2019 at 06:22:47PM +0200, John Koppolu wrote:
> tail: unrecognized file system type 0x794c7630 for ‘/var/log/messages’.
> please report this to bug-coreutils@gnu.org. reverting to polling

You've previously reported this 4 days ago,
please see the reply there:
  https://bugs.gnu.org/36600#8

-assaf






Re: doc: add "version sort" chapter

2019-07-11 Thread Assaf Gordon
Hello,

On Thu, Jul 11, 2019 at 03:36:19PM +0100, Pádraig Brady wrote:
> On 10/07/19 19:57, Assaf Gordon wrote:
> > 
> > I would like to suggest adding a new chapter to the manual,
> > detailing the nitty-gritties of "version sort" in coreutils.
> > 
> A few adjustments attached.
> 

Thanks.

Attached the updated version,
including improvements Bernhard sent off-list.

Comments welcomed,
 - assaf


0001-doc-add-version-sort-ordering-chapter.patch.gz
Description: application/gunzip


bug#36600: unrecognized file system type 0x794c7630 for ‘/var/log/messages’. please report this to bug-coreutils@gnu.org. reverting to polling

2019-07-11 Thread Assaf Gordon
tag 36600 notabug
close 36600
stop

Hello,

On Thu, Jul 11, 2019 at 05:53:16PM +0200, John Koppolu wrote:
> unrecognized file system type 0x794c7630 for ‘/var/log/messages’. please
> report this to bug-coreutils@gnu.org. reverting to polling
>

This has system (overlayfs, commonly used with Docker containers) has
been added in version 8.25. Consider upgrading Coreutils if possible.

See https://www.gnu.org/software/coreutils/filesystems.html for
more details.

regards,
 - assaf





Re: How to print sizes of both files and directories in a directory?

2019-07-01 Thread Assaf Gordon

Hello,

On 2019-07-01 12:10 a.m., Peng Yu wrote:

`du -h --max-depth=1` only print directory sizes. Is there a way to
print the sizes of both directories and files in a directory?


  du -h --max-depth=1 --all

as mentioned in the --help screen:

-d, --max-depth=N print the total for a directory (or file, with
  --all) only if it is N or fewer levels below
  the command line argument;


regards,
 - assaf




Re: How to sort and count efficiently?

2019-06-30 Thread Assaf Gordon



On 2019-06-30 11:10 a.m., Peng Yu wrote:
The problem with this kind of awk program is that everything will be 
loaded to memory.


Well, those are the to main options: store in memory or resort to disk
I/O. each has its own pros and cons.


But bare `sort` use external files to save memory.

Not exactly - The goal is not to "save" memory -
Sort resorts to external files to be able to complete the
sort even with it runs out of the (alloted) memory (which can be
controlled with the "-S" parameter).

I'm not familiar with a program which implements hashing backed by file-
storage, but perhaps such program exists.

When the hash in awk is too large, accessing it can become very slow 
(maybe due to potential cache miss or slow down of hash as a function of 
hash size).


Nothing is "free", and using a hash incurs its own costs.

If you're using the simplified awk hashing program,
try to use other AWK implementations than GNU awk (e.g. I have had
some performance gains from switching to "mawk", the default awk in
Debian).

  $ printf "%s\n" a c b b b b b b c \
 | mawk 'a[$1]++ {} END { for (i in a) { print i, a[i] } }'
  a 1
  b 6
  c 2

Or, if your input is exceedingly large, perhaps consider pre-processing
it and splitting the input into smaller files - each one will have less
strings and hashing them will consume less memory.
The following example splits the input file into 27 files, based on
the first letter of the string (and an "other" file for non-letters):

 mawk '{ l = tolower(substr($0,1,1)) ;
 if (l>="a" && l<="z") {
   print $0 > l
 } else {
   print $0 > "other"
 }
   }' INPUT

This is an O(N) operation that doesn't consume any memory (just lots of
disk I/O) - and the resulting files will be much smaller - then can be 
hashed with less memory.


Of course this can be extended to split into smaller-grained files.

-assaf



Re: How to sort and count efficiently?

2019-06-30 Thread Assaf Gordon
Correcting myself:

On Sun, Jun 30, 2019 at 10:08:46AM -0600, Assaf Gordon wrote:
> On Sun, Jun 30, 2019 at 07:34:19AM -0500, Peng Yu wrote:
> > 
> > I have a long list of string (each string is in a line). I need to
> > count the number of appearance for each string.
> > 
> > [...] Does anybody know any better way
> > to make the sort and count run more efficiently?
> > 
> 
> Or using gnu awk:

use 'asorti' instead of 'asort', with the two-parameter variant:


  $ printf "%s\n" a c b b b b b b c \
| awk 'a[$1]++ {}
   END { n = asorti(a,b)
 for (i = 1; i <= n; i++) {
print b[i], a[b[i]]
 }
   }'
  a 1
  b 6
  c 2


For more details see:
https://www.gnu.org/software/gawk/manual/html_node/Array-Sorting-Functions.html#Array-Sorting-Functions

-assaf




Re: How to sort and count efficiently?

2019-06-30 Thread Assaf Gordon
Hello,

On Sun, Jun 30, 2019 at 07:34:19AM -0500, Peng Yu wrote:
> Hi,
> 
> I have a long list of string (each string is in a line). I need to
> count the number of appearance for each string.
> 
> I currently use `sort` to sort the list and then use another program
> to do the count. The second program doing the count needs only a small
> amount of the memory as the input is sorted.
> 
> [...] Does anybody know any better way
> to make the sort and count run more efficiently?
> 

Using awk:

  awk 'a[$1]++ {} END { for (i in a) { print i, a[i] } }' INPUT \
| sort -k1,1

Or using gnu awk:

  awk 'a[$1]++ {}
   END { n = asort(a) ;
 for (i = 1; i <= n; i++) {
 print i, a[i]
 }
   }'


regards,
 - assaf



bug#35939: version sort is incorrect with hyphen-minus

2019-06-26 Thread Assaf Gordon
Hello Paul,

On Wed, Jun 26, 2019 at 12:57:14PM -0700, Paul Eggert wrote:
> GNU sort uses the same algorithm as glibc strverscmp,

I think that both sort and ls use 'filevercmp' - a simplified version
that does not support locales (and doesn't fail).

The change (from 'strvercmp') was made in:

  commit e505736f8211a608b00dfe75fb186a5211e1a183
  Author: Kamil Dudka 
  Date:   Fri Oct 3 11:03:40 2008 +0200
  ls and sort: use filevercmp instead of strverscmp
  
https://git.savannah.gnu.org/cgit/coreutils.git/commit/?id=e505736f8211a608b00dfe75fb186a5211e1a183

> Has the Debian version-comparison algorithm changed since 1997? If so, could
> you give details about the changes to the Debian algorithm?

I don't think the algorithm changed in Debian,
and also in gnulib there are only a handful of relevant commits, all 10
years old:

  9121662f1 2008-10-03 filevercmp: new module
  0443c2f39 2009-03-05 filevercmp: Move hidden files up in ordering.
  1721cf06d 2009-03-24 filevercmp: handle simple~ and numbered.~3~ backup 
suffixes
  4fd008794 2009-04-09 filevercmp: fix regression
  cc96df30d 2009-04-09 filevercmp: correct today's change

I think (also based on Ian's confirmation) that this discrepancy was
from the beginning.

I now notice that there's an additional difference: coreutils/gnulib has
special handling for extension, hidden files and backup files.

As Ian wrote, a documentation improvement is probably the best fix.
I'll try to come up with a suggested change.

-assaf

P.S.

For completion, here are few other threads with details/explanations
about 'version-sort':
https://bugs.gnu.org/18168
https://bugs.gnu.org/22275
https://bugs.gnu.org/22455
https://bugs.gnu.org/33786





bug#35939: version sort is incorrect with hyphen-minus

2019-06-26 Thread Assaf Gordon
(Adding Ian Jackson for dpkg/debian-version details)

Hello,

On Tue, May 28, 2019 at 02:53:39AM +0200, Vincent Lefevre wrote:
> With GNU coreutils 8.30 under Debian/unstable, I get:
> 
> $ LC_ALL=C ls
> ab-cd  abb  abe
> $ LC_ALL=C ls -v
> abb  abe  ab-cd
> 
> The hyphen-minus character should still be regarded as being less
> than the letters (there are no digits, so both are expected to be
> equivalent). The GNU coreutils manual says:
> 
[...]

Thanks for the report and the clear details.

To summarize,
"ls -v" and "sort -V" (coreutils' version sort) behaves differently than
other implementations in regards to minus character:

$ printf "%s\n" abb ab-cd | sort -V
abb
ab-cd

$ v1="abb"
$ v2="ab-cd"
$ dpkg --compare-versions "$v1" lt "$v2" && printf "$v1\n$v2\n" || printf 
"$v2\n$v1\n"
ab-cd
abb

If I understand correctly,
The reason is that in Debian's version comparison algorithm [1], the minus
character has a special meaning: it separates the "upstream version"
part from the "debian revision" part.

In Debian's implementation [2], a version string is first split into three
parts (epoch, upstream version, debian revision) using ":" for epoch
delimiter and "-" for revision delimiter. Only then the three parts are
compared, separately [3].

[1] https://www.debian.org/doc/debian-policy/ch-controlfields.html#version
[2] https://git.dpkg.org/cgit/dpkg/dpkg.git/tree/lib/dpkg/parsehelp.c#n191
[3] https://git.dpkg.org/cgit/dpkg/dpkg.git/tree/lib/dpkg/version.c#n140

On ther other hand, coreutils' implementation (from gnulib [4]) does not
break version string into three parts - it treats the entire string as a
single "upstream version" part.
The rules for sorting the "upstream version" string say:

  "... The lexical comparison is a comparison of ASCII values modified so
  that all the letters sort earlier than all the non-letters and so that a
  tilde sorts before anything" (from [1])

[4] https://git.savannah.gnu.org/cgit/gnulib.git/tree/lib/filevercmp.c

Therefore, dpkg first seprates "ab" from "cd", then compares "ab" to
"abb" - and 'ab' comes first;
Coreutils compare "ab-cd" to "abb" (or technically, just "ab-" to
"abb"), and because "letters sort earlier than all non-letters", "abb"
comes first.

I hope this helps explain the differences (I also hope this explanation is
correct, and I invite others to chime in).


regards,
 - assaf






bug#35654: We've found a vulnerability of gnu chown, please check it and request a cve id for us.

2019-06-26 Thread Assaf Gordon
tag 35654
close 35654
stop

Hello,

On Thu, May 09, 2019 at 11:53:11PM +0800, st0n3 ss wrote:
> Hello! we have found a vulnerability of command chown, please check it.If
> it is a vulnerability. please request a cve id for use, thank you!chown -h
> bypass

Given Paul's and Bob's detailed answers, I'm closing this as "not a bug".

Discussion can continue by replying to this thread.

regards,
 - assaf





bug#36130: split bug

2019-06-26 Thread Assaf Gordon
tag 36130 notabug
close 36130
stop

Hello,

On Mon, Jun 10, 2019 at 04:50:20PM -0600, Assaf Gordon wrote:
> On 2019-06-10 12:28 p.m., Heather Wick wrote:
> > Verbose: This seems to have made the same number of files this time; not
> > sure why the other 3-4 times I ran it it did not. They appear to be the
> > same size, with paired last reads
> [...]
> 
> Glad to hear it worked.
> 
> Could it be that in previous times the queued job ran out of disk space?
> 
> That would be my first guess, as such things are common in shared
> grid/cluster environments, particularly if your job runs in a temporary
> and limited storage location (e.g. "/tmp/job-").


With no further comments, I'm closing this ticket.
If more issues arise (or this was not adequate solution) we can always
re-open this ticket.

regards,
 -assaf





bug#35632: date Parse of '13:00 + 2 hours' Broken.

2019-06-26 Thread Assaf Gordon
tag 35632 notabug
close 35632
stop

Hello,

(sorry for the delayed reply)

On Wed, May 08, 2019 at 12:57:10PM +0100, Ralph Corderoy wrote:
> 
> Using date from coreutils 8.31-1 on Arch Linux.
> This surprised me.
> 
> $ TZ=UTC0 /bin/date -d '1pm + 2 hours'
> Wed  8 May 15:00:00 UTC 2019
> $ TZ=UTC0 /bin/date -d '13:00 + 2 hours'
> Wed  8 May 12:00:00 UTC 2019
> 
> The documentation doesn't suggest `1pm' and `13:00' are treated
> differently.  `--debug' helps.
> 
> $ TZ=UTC0 /bin/date --debug -d '1pm + 2 hours'
> date: parsed time part: 01:00:00pm
> date: parsed relative part: +2 hour(s)
> ...
> $ TZ=UTC0 /bin/date --debug -d '13:00 + 2 hours'
> date: parsed time part: 13:00:00 UTC+02
> date: parsed relative part: +1 hour(s)
> date: input timezone: parsed date/time string (+02)
> ...
> 
> It looks like parsing is broken in the second case.

Thank you for for providing detailed output with "--debug",
makes things easier to troubleshoot.

When encountering a time string (HH:MM or HH:MM:SS) followed by a plus
sign and a number, date's parser *always* treats it as a timezone
(giving timezones higher priority than time adjustments).


> The result I wanted can also be obtained my omitting the `+'.
> 
> $ TZ=UTC0 /bin/date -d '1pm 2 hours'
> Wed  8 May 15:00:00 UTC 2019
> $ TZ=UTC0 /bin/date -d '13:00 2 hours'
> Wed  8 May 15:00:00 UTC 2019

And this is indeed one possibly solution.

Other similar issues are detailed here:
https://lists.gnu.org/archive/html/bug-coreutils/2018-10/msg00126.html

As such, I'm closing this ticket, but discussion can continue by
replying to this thread.

regards,
 - assaf






bug#36383: date command processes timezone differently when doing math

2019-06-26 Thread Assaf Gordon
tag 36383 notabug
close 36383
stop

Hello,

On Tue, Jun 25, 2019 at 04:10:07PM -0700, Brian Woods wrote:
> When doing a math operation to a date command it appear to process the
> timezone differently.
[...]
>
> #echo $datNow
> 2019-06-25 15:21:34
>
> #date -d "$datNow + 1 minute" "+%Y-%m-%d %H:%M:%S" --debug
> date: parsed date part: (Y-M-D) 2019-06-25
> date: parsed time part: 15:21:34 UTC+01
> date: parsed relative part: +1 minutes
> date: input timezone: parsed date/time string (+01)

Thank you for providing detailed examples with "--debug",
makes things much easier to troubleshoot.

The issue is that a time string (HH:MM:SS) followed by a plus
sign and a number is *always* taken to be a time zone.

Using a value other than 1 will show it more clearly:

  $ date -d "$datNow + 8 minutes" "+%Y-%m-%d %H:%M:%S" --debug
  date: parsed date part: (Y-M-D) 2019-06-25
  date: parsed time part: 15:21:34 UTC+08
  date: parsed relative part: +1 minutes
  date: input timezone: parsed date/time string (+08)

The "+8" part is treated as timezone,
and the remaining text ("minutes") is taken as a one-minute time
adjustment.

One solution is to just remove the plus sign:

  $ date -d "$datNow 8 minutes" "+%Y-%m-%d %H:%M:%S" --debug
  date: parsed date part: (Y-M-D) 2019-06-25
  date: parsed time part: 15:21:34
  date: parsed relative part: +8 minutes
  date: input timezone: system default
  [...]
  2019-06-25 15:29:34

Another is to specify the time zone:

  $ date -d "$datNow +00:00 +8 minutes" "+%Y-%m-%d %H:%M:%S" --debug
  date: parsed date part: (Y-M-D) 2019-06-25
  date: parsed time part: 15:21:34 UTC+00
  date: parsed relative part: +8 minutes
  date: input timezone: parsed date/time string (+00)
  [...]
  2019-06-25 09:29:34


More examples of adjusting time strings are here (your example is similar
to case #1):
https://lists.gnu.org/archive/html/bug-coreutils/2018-10/msg00126.html

As such, I'm closing this ticket but discussion can continue by replying
to this thread.

regards,
 - assaf





Re: About cc and dd

2019-06-23 Thread Assaf Gordon

Hello,

On 2019-06-23 9:06 a.m., altear wrote:


If you don't mind, can you tell me how cp and dd works, just in summary, or 
maybe can tell me from which line thats code work?



A nice code overview is available here:
  http://www.maizure.org/projects/decoded-gnu-coreutils/cp.html
  http://www.maizure.org/projects/decoded-gnu-coreutils/dd.html

And code exploration (using OpenGrok) here:
  https://opengrok.housegordon.com/source//xref/coreutils/src/dd.c
  https://opengrok.housegordon.com/source//xref/coreutils/src/cp.c


-assaf






Re: [musl] Re: date-debug test failure with musl

2019-06-12 Thread Assaf Gordon

Hello,

On 2019-05-16 11:52 a.m., Niklas Hambüchen wrote:


will you submit your patch for inclusion, given that it works well?


pushed here:
https://git.savannah.gnu.org/cgit/coreutils.git/commit/?id=0251229bfd9617e8a35cf9dd7d338d63fff74a0c

-assaf



bug#36130: split bug

2019-06-10 Thread Assaf Gordon

Hello,

On 2019-06-10 12:28 p.m., Heather Wick wrote:
Thank you so much for your response. Here are the results of the tests 
you sent:


Verbose: This seems to have made the same number of files this time; not 
sure why the other 3-4 times I ran it it did not. They appear to be the 
same size, with paired last reads

[...]

Glad to hear it worked.

Could it be that in previous times the queued job ran out of disk space?

That would be my first guess, as such things are common in shared 
grid/cluster environments, particularly if your job runs in a temporary

and limited storage location (e.g. "/tmp/job-").

I would suspect that the exit-code you are seeing is the exit code
of the entire job (that is - of the shell script that is being qsub'd),
and not necessarily that of 'split' (then again, this might not be 
correct if you explicitly checked the exit code of 'split').


Given that your grid environment already has configuration issues
(the bash and "module" related errors), I would not be surprised if
the exit code is not reliable.

I would strongly encourage to always look into the STDERR file
of the job to verify no other errors occurred.

Or, perhaps write shell scripts more defensively, like so:

  [...]
  zcat MH1_R1.fastq.gz | split -l 4000 - DHT_R1_ \
&& echo split MH1_R1 OK \
|| echo split MH1_R1 FAILED
  [...]

Then checking the STDOUT for positive confirmation each program succeeded.
Or perhaps:


  # define a shell function "die" to print an error and terminate
  die()
  {
base=$(basename "$0")
echo "$base: error: $*" >&2
exit 1
  }

  zcat MH1_R1.fastq.gz | split -l 4000 - DHT_R1_ \
|| die "split MH1_R1 failed"


And then run at least one job that will fail on purpose,
and ensure you see the error message in the STDERR log,
and you get a non-zero exit code (and then ensure you use 'die'
on every command).


It is sometimes recommended to use "set -e" for "easy"
error handling in shell scripts- but I would recommend against it.
Many reasons detailed here: https://mywiki.wooledge.org/BashFAQ/105

It might be more frustrating to add such extra checks on every
program, but from my humble experience, grid environments bring
on so many more intermittent and transient problems that it is
definitely worth it.



STDERR:
The only thing in the stderr file is an odd duck of:

-sh: module: line 1: syntax error: unexpected end of file

-sh: error importing function definition for `BASH_FUNC_module'

Python 3.6.8 :: Anaconda, Inc.

/bin/sh: module: line 1: syntax error: unexpected end of file

/bin/sh: error importing function definition for `BASH_FUNC_module'

but this prints for every job I run with this particular flavor of 
conda/bash and doesn't seem to affect anything else (as far as I know)



These errors are specific to your grid/cluster environment,
and the best place to ask is the I.T or bioinformatics department in
your institute (whomever is in charge of the cluster).

Broadly speaking, "module" is mechanism that ease the use of
various software packages. It is usally setup by your IT administrators.
A typical use-case is to have different version of programs in non-
standard locations, e.g.
   samtools version 1.6 in /opt/it/programs/samtools-1.6
 and
   samtools version 1.9 in /opt/bioinfo/tools/new/samtools/

and then cluster users (e.g. you) just need to add:
   "module load samtools-1.8"
and have the command "samtools" just work without knowing
the gritty details of where the program is.

It seems that in your case, something relating to the "module"
setup is broken.

More information here: 
https://en.wikipedia.org/wiki/Environment_Modules_(software)



All jobs finished well below allotted memory and with exit status 0, 
even when split didn't make the right number of output files.

>
> Do you know any reason why the behavior would be inconsistent?

The "alloted memory" is a non-issue for this "split" command,
it will always use very little amount of memory regardless of how big
the input files are.

As for "exit status 0" - I can't be sure, but I suspect the exit status
you see is the one of the entire job (i.e. the shell script),
and perhaps it does not represent the exit code of the "split" program.

If you have the STDERR files of the jobs which failed, it's worth
checking them for any additional error messages.



Pairing check: unfortunately my server's version of bash doesn't support 
paste in this way, I've run into this issue before but I forget what the 
workaround is. I can't run this command interactively because my server 
times out (these files are > 3 billion lines each, so it takes a long 
time to zcat them)


Ah yes, the construct:

   program <(other program)

is a "bash" feature that is not available in simple shell scripts
(interactive use vs non-interactive and other things).

One work-around is to run (from inside your script):

  bash -c "paste <(zcat MH1_R2.fastq) <(zcat MH1_R2.fastq.gz)" \
   | awk 'NR%4!=1 

bug#36130: split bug

2019-06-07 Thread Assaf Gordon
Hello,

On Fri, Jun 07, 2019 at 09:48:44PM -0400, Heather Wick wrote:
> Yes, sorry, I should have specified that I already checked that the
> original fastq files are indeed paired and sorted with the same number of
> lines and same starting/ending IDs, narrowing down the issue to a problem
> with split.

It could be a problem with "split", but we'll need to dig a bit deeper
to be able to pinpoint the exact issue.

Could you please try the following commands and post the results?

zcat MH1_R1.fastq.gz \
   | split --verbose -l 4000 - DHT_R1_ > DHT_R1.log ; echo DHT_R1 exit 
code: $?
zcat MH1_R2.fastq.gz \
   | split --verbose -l 4000 - DHT_R2_ > DHT_R2.log ; echo DHT_R2 exit 
code: $?
wc -l DHT_R1.log DHT_R2.log

Two more questions:
1. can you post the result of "split --version" ?
2. You mentioned "jobs" - if you are running these as submitted jobs on
a cluster (e.g. with "qsub"), can you double-check the STDERR log files
to ensure no errors where encountered ?

If we still can't pinpoint the issue, the next steps would be to check
the DHT_R{1,2}.log files, and then try to compare the content of the
splitted files.

I assume the input files are indeed correctly paired, but just to check,
if you could try the following command, it should not print anything
to the screen (indicating all sequence IDs are paired):

paste <(zcat MH1_R2.fastq) <(zcat MH1_R2.fastq.gz) \
   | awk 'NR%4!=1 { next } $1!=$3 { print "Error in line " NR ":" $1 " vs " 
$3 }'

regards,
 - assaf







Re: Error with clock?

2019-06-07 Thread Assaf Gordon
Hello,

On Fri, Jun 07, 2019 at 04:15:33PM +, h.lansel wrote:
> I am using Debian Stretch with XFCE.  I was customizing my clock, and
> encountered a few errors.  I don't know what package in relation to
> this software cause this and I don't even know how to do bug report,
> so I am writing to you, instead to bug report e-mail address.

As you suspected, this is not the right mailing list for such bugs.

The problem you describe might be related to Debian, or to the XFCE
project (or to another project, if the clock applet you are using
is not a built-in XFCE applet).

A good place to start is likely the Debian user mailing list:
   https://lists.debian.org/debian-user/
Or perhaps submitting a Debian bug (if you are sure it is a bug):
   https://www.debian.org/Bugs/Reporting

Alternatively, if Debian people indicate it is a problem in XFCE,
there is the XFCE bugzilla website:
   https://bugzilla.xfce.org/

This mailing list relates to GNU coreutils - a collection of
command-line programs which are not typically used directly in XFCE.

regards,
 - assaf





bug#36130: split bug

2019-06-07 Thread Assaf Gordon
Hello,

On Fri, Jun 07, 2019 at 02:23:15PM -0400, Heather Wick wrote:
> I am using split to split up some large, paired fastq files [...]:
>
>   zcat MH1_R1.fastq.gz | split - -l 4000 DHT_R1_
>   zcat MH1_R2.fastq.gz | split - -l 4000 DHT_R2_
>
> This creates 96 chunks for the R1 and 95 chunks for R2, even though the
> orignal fastq files have the same number of reads.
>
> Do you have any suggestions for how to proceed? Perhaps zcatting and piping
> the files is not the best way to call split?

To help diagnose to issue better, please run the following commands
and tell us what are the results:

1. number of lines in each file:

   zcat MH1_R1.fastq.gz | wc -l
   zcat MH1_R2.fastq.gz | wc -l

2. The first two sequence IDs:

   zcat MH1_R1.fastq.gz | head -n8 | grep ^@
   zcat MH1_R2.fastq.gz | head -n8 | grep ^@

3. Last two sequence IDs:

   zcat MH1_R1.fastq.gz | tail -n8 | grep ^@
   zcat MH1_R2.fastq.gz | tail -n8 | grep ^@

These will just verify the FASTQ files are indeed paired with no
surprises. The files should have the same number of lines,
and matching sequence IDs in the first and last lines.

regards,
 - assaf






Re: question about parallelism in cp command

2019-06-06 Thread Assaf Gordon
> -Original Message-
> From: Olga Kornievskaia [mailto:a...@umich.edu]
>
> Is there something philosophically incorrect in making a “cp”
> multi-threaded and allow for parallel copies when “cp -r” is done? If
> it’s something that’s possible, are there any plans in making a
> multi-threaded cp?

On Thu, Jun 06, 2019 at 02:17:40PM -0400, Olga Kornievskaia wrote:
> The use case I'm consider are network file systems. So perhaps a
> default can be a single threaded system for the local filesystems but
> add an option to cp for the -r case that would enable network file
> system to copy files in parallel.

In an interesting coincidence, see recent post by Paul Kolano here:
https://lists.gnu.org/archive/html/coreutils/2019-06/msg00011.html

(Note that his suggestions have not been reviewed yet, so this is
neither endorsement nor criticism of his code.)

regards,
 - assaf



Re: patches for multi-threaded cp and md5sum (along with other features)

2019-06-06 Thread Assaf Gordon
Hello Paul,

On Mon, Jun 03, 2019 at 09:29:20PM +, Paul Kolano (ARC-TN)[InuTeq, LLC] 
wrote:
> Many years ago, I developed a set of patches to add a number of
> features to cp and md5sum [...]
>   https://pkolano.github.io/projects/mutil.html

Thanks for sharing, looks very impressive.

Because the changes are massive, before we can start looking into their
details and merits we'll need copyright assignment from the copyright
holder of the code (you or NASA).

For details please see here:
  https://www.gnu.org/licenses/why-assign.en.html

To start the process, please fill and submit the following form:
  
https://git.sv.gnu.org/cgit/gnulib.git/tree/doc/Copyright/request-assign.future

---

Additionally,
A cursory look at the patches [1] reveals several added terms in
accordance of GPLv3 section 7 (e.g. Indemnifying NASA and the U.S.
government).
This is of course absolutely fine and valid for a GPL project,
but I'm not sure if the FSF will agree to add additional terms to GNU
coreutils (I'm not saying they won't, I simply don't know).
Perhaps other maintainers can chime in, and if not, it is probably wise
to ask licens...@gnu.org before we can consider these patches for
inclusion.

[1]
https://github.com/pkolano/mutil/blob/master/patch/coreutils-8.22.patch

---

regards,
 - assaf



Re: How to calculate date relative to another date?

2019-06-01 Thread Assaf Gordon
Hello,

On Wed, May 22, 2019 at 10:41:52AM -0400, Michael Stone wrote:
> In general my advice is to just avoid the date parsing entirely, it will
> never, ever do what you predict.

I'm sorry to hear that is your experience with date(1) parsing.

My different advice is to use "date --debug" to first troubleshoot what
is being parsed, then search the mailing list archives for many common
solutions, and lastly, write to coreutils@gnu.org with questions.

> If you find something that happens to work,
> just copy and paste it and never change it. It would be nice if there were a
> new, simple and predictable grammer option in date(1) (abandon the natural
> language guessing) but nobody has ever wanted to do the work. :)

The grammar is predictable (though perhaps not trivial) for the simple
reason it is based on a fixed set of rules defined in a GNU Bison
".y" file: https://git.sv.gnu.org/cgit/gnulib.git/tree/lib/parse-datetime.y .

There are no "natural language guessing" algorithms.

Instead, and perhap that's the confusing part, there are many attempts
by the parser to match date strings into known meaning.

For example,
/NN/NN is parsed as /MM/DD.
NN/NN/ is parsed as MM/DD/ (the north american way).
NN is parsed as YYMMDD (with YY being 19YY or 20YY with 69 as the
cutoff).

Then similar pattern are matched for time, timezone, and date/time
adjustments.

The different formats and patterns are explained here:
https://www.gnu.org/software/coreutils/manual/html_node/Date-input-formats.html#Date-input-formats

> You might try "2018-05-01 59 months ago", but I'd suggest using a python
> module or somesuch with a more regular grammar if you want something
> maintainable in the long term.

I would argue that "long term" and "maintainable" is exactly what GNU
date(1) parsing is. You'd be hard-pressed to find programs with
longer-term support than gnu date(1), including python modules.

The confusing and possibly frustrating part happens when trying to mix
different parsing "parts" like date and time and timezone and relative
time calculations.

The "--debug" option should be the first tool to use.

The most common issues are:

Crossing daylight-saving-time (getting unexpected "tomorrow" results):
https://lists.gnu.org/archive/html/bug-coreutils/2019-04/msg3.html
https://lists.gnu.org/archive/html/bug-coreutils/2016-04/msg00046.html
https://debbugs.gnu.org/cgi/bugreport.cgi?bug=30795

Mixing time and time-zones:
https://lists.gnu.org/archive/html/bug-coreutils/2018-10/msg00126.html

Months-related adjustments:
https://lists.gnu.org/archive/html/bug-coreutils/2018-10/msg00357.html

General adjustments, and order of operations:
https://lists.gnu.org/archive/html/bug-coreutils/2018-02/msg5.html

Leap years and such:
https://lists.gnu.org/archive/html/bug-coreutils/2017-03/msg00047.html

Inner-working of date adjustments:
https://lists.gnu.org/archive/html/bug-coreutils/2017-03/msg00044.html

Hope this helps,
 -assaf




  1   2   3   4   5   6   7   8   9   10   >