Re: thoughts on NO_COLOR

2022-02-28 Thread Kaz Kylheku (Coreutils)

On 2022-02-27 12:37, Pádraig Brady wrote:

I just noticed some de facto treatment of the NO_COLOR env var.
https://no-color.org/


These people are not system implementors; they should not be proposing
variables in a POSIX-reserved namespace.

The website provides no contact links whatsoever; they have walled
themselves in Github, and take pull requests as the only means of
communication.

This is not the way to do things if you want to promote a standard,
or to engage with the free software world.

I'm instantly opposed to this NO_COLOR on the above grounds, and think
it's an excellent idea to show opposition by making programs call
abort() when they find NO_COLOR to be set.

However, I do think it's a great idea for users who don't want
color to have a single place to turn it off.


I was considering having ls --color=auto honor this, but then thought
it is not actually needed in ls, since we give fine grained
control over the colors / styles used.

For example one might very well always want at least some 
distinguishing

of files and directories, with bold / bright etc.
which can be achieved now with LS_COLORS.

Or looking at it another way, ls is ubiquitous enough
that it's probably already color configured as the user desires,


ls is almost certainly color configured the way the user's distro
desires, at least initially.

Clearly, the variable is aimed at people who don't find
things configured as they want by default, and who don't want to
go into individual programs or distro scripts to do that, but only
flip a single master switch.


and having ls honor the less fine grained NO_COLOR flag,
would result in less flexibility.


More knobs to tweak can never result in less flexibility.

Were you thinking of "perplexity"? ;)

Speaking of which, I can see someone pulling their hair out trying
to get color working, not realizing that something set a NO_COLOR
environment variable somewhere.





Re: how to speed up sort for partially sorted input?

2021-08-12 Thread Kaz Kylheku (Coreutils)

On 2021-08-11 11:58, Peng Yu wrote:

On Wed, Aug 11, 2021 at 1:43 PM Kaz Kylheku (Coreutils)
<962-396-1...@kylheku.com> wrote:


On 2021-08-11 05:03, Peng Yu wrote:
> On Wed, Aug 11, 2021 at 5:29 AM Carl Edquist 
> wrote:
>> (With just a bit more work, you can do all your sorting in a single
>> awk
>> process too (without piping out to sort), but i think you'll still be

  

>> disappointed with the performance compared to a single sort command.)
>
> Yes, this involves many calls of the coreuils' sort, which is not

  ^^^


No, not this last remark, which is about "in a single awk process".


I know there is one awk process. I don't understand why you mentioned 
it.


(That's why.)


> efficient. Would it make sense to add an option in sort so that sort
> can sort a partially sorted input in one shot.

IF you're willing to use GNU Coreutils instead of Unix, you probably
have


I don't think using awk is efficient. I am program a number awk
programs for simple transforming the input and tested it, in general,
it is slower than the equivalent python code, let along C code.

You can talk about doing most of the work in awk below. I don't think
that make sense. Having coreutils' sort be able to do a partial sort
is a more reasonable solution.


The solution doesn't exist today, whereas that Gawk program should
run even in ten year old installations.

For the solution to be useful, it only has to beat the actual sort
which you have available today, not some imagined version of sort
that isn't yet available.

I'm assuming that you're posting here because you have some real
problem to solve, not just to postulate chrome plating for Coreutils,
and so that a working program today would be of use to you.

A vast amount of useful computing is being done with tools and
approaches that are not thoroughly optimized.

Sometimes those approaches usefully prototype a solution which
is later optimized or replaced; in the meantime, that solution
serves a useful purpose.




Re: how to speed up sort for partially sorted input?

2021-08-11 Thread Kaz Kylheku (Coreutils)

On 2021-08-11 05:03, Peng Yu wrote:
On Wed, Aug 11, 2021 at 5:29 AM Carl Edquist  
wrote:
(With just a bit more work, you can do all your sorting in a single 
awk

process too (without piping out to sort), but i think you'll still be
disappointed with the performance compared to a single sort command.)


Yes, this involves many calls of the coreuils' sort, which is not


No, not this last remark, which is about "in a single awk process".


efficient. Would it make sense to add an option in sort so that sort
can sort a partially sorted input in one shot.


IF you're willing to use GNU Coreutils instead of Unix, you probably 
have

GNU Awk also. GNU Awk has a sorting function using which a solution
could be cobbed together. Maybe something like:


function dump_delete_data()
{
   n = asorti(data, idx);
   for (i = 1; i <= n; i++)
 print data[idx[i]];
   delete data
   serial = 0
}

BEGIN   { serial = 0 }
$1 != prev_1{ dump_delete_data() }
NF >= 2 { prev_1 = $1
  data[$2 "." serial++] = $0
  next }
1   { dump_delete_data()
  print }
END { dump_delete_data() }


The asorti function has some features behind it to sort in various ways;
you have to look into that. It involves manipulating a 
PROCINFO["sorted_in"]

value.

It's possible to use a custom comparison function.

For more info, see GNU Awk documentation, the Gawk mailing list or
the comp.lang.awk newsgroup.

The purpose of the serial variable in my above code so that we get
two entries in data[] if in a given group, there are identical $2 
values.


For instance if $2 is "foo", then the key we use is actually "foo.3" if
the current value of serial is 3. The sorting is then done on these
suffixed keys, which works okay for lexicographic sorting.

It is not a stable sort, though! Because foo.123 will be sorted before
foo.23, even though the 123 serial value comes later. If we padded the
integer with enough leading zeros for the largest possible group, it
would then be stable: foo.00023 would come before foo.00123:

   data[sprintf("%s.%08X", $2, serial++)] = $0

kind of thing.

If you don't care about reproducing duplicates, you can remove this 
logic

entirely.

How the overall program works is that data[] is an array indexed on the
second column values (plus serial suffixes). The value of each index
value is the entire record, $0.

asorti sorts the $2 indices, throwing away the $0 values, which
is why we direct it into a secondary array called idx, preserving
the data array. The idx array ends up indexed on integer values 1 to N,
where N is the chunk size. If we iterate over these values, idx[i]
gives us the $2 column values (with serial suffix) in sorted order.
We can then use that as the key into data[] to get the corresponding
records in sorted order.

Cheers ...




Re: how to speed up sort for partially sorted input?

2021-08-10 Thread Kaz Kylheku (Coreutils)

On 2021-08-10 22:06, Kaz Kylheku (Coreutils) wrote:

On 2021-08-07 17:46, Peng Yu wrote:

Hi,

Suppose that I want to sort an input by column 1 and column 2 (column
1 is of a higher priority than column 2). The input is already sorted
by column1.

Is there a way to speed up the sort (compared with not knowing column
1 is already sorted)? Thanks.


Since you know that colum 1 is sorted, it means that a sequential scan
of the data will reveal chunks that have the same colum1 value.

You just have to read and separate these chunks, and sort each one
individually by column 2.

GNU Awk has the wherewithal for this sort of thing; it has some 
facilities

for sorting associative arrays.


TXR Lisp: structure + function + awk macro:

(defstruct rec ()
  f1
  rec)

;; sort list of records by slot f1, and then
;; dump their rec slots in order via put-line.

(defun dump (data)
  (mapdo [chain .rec put-line] (nsort data : .f1)))

(awk
  ;; two local variables
  (:let f0-prev data)

  ;; if field zero is not same as f0-prev, sort and dump data,
  ;; then set data to nil again

  ((nequal [f 0] f0-prev) (dump data)
  (set data nil))

  ;; if we have a second field, set f0-prev to that field,
  ;; then capture the record in a structure
  '' and push it on the list.  Go to the next record.

  ([f 1] (set f0-prev [f 0])
 (push (new rec f1 [f 1] rec rec) data)
 (next))

  ;; we don't have a second field: just sort and
  ;; dump the accumulated data, and also print this record.

  (t (dump data)
 (prn))

  ;; end of data: sort and dump accumulated data.
  (:end (dump data)))



Re: how to speed up sort for partially sorted input?

2021-08-10 Thread Kaz Kylheku (Coreutils)

On 2021-08-07 17:46, Peng Yu wrote:

Hi,

Suppose that I want to sort an input by column 1 and column 2 (column
1 is of a higher priority than column 2). The input is already sorted
by column1.

Is there a way to speed up the sort (compared with not knowing column
1 is already sorted)? Thanks.


Since you know that colum 1 is sorted, it means that a sequential scan
of the data will reveal chunks that have the same colum1 value.

You just have to read and separate these chunks, and sort each one
individually by column 2.

GNU Awk has the wherewithal for this sort of thing; it has some 
facilities

for sorting associative arrays.

You can scan records and aggregate them while column1 is the same,
then do some sorting and output (also at the end of the file).

Good luck!




Re: Suggest on "ln"

2021-07-19 Thread Kaz Kylheku (Coreutils)

On 2021-07-19 00:50, Patrick Reader wrote:

On 19/07/2021 08:48, Kamil Dudka wrote:

On Monday, July 19, 2021 2:29:18 AM CEST James Lu wrote:
"ln" should write a warning to stderr if the source file doesn't 
exist.
ln writes an error message to stderr if the source file does not 
exist:


$ mkdir new-dir
$ cd new-dir
$ ln does-not-exist target
ln: failed to access 'does-not-exist': No such file or directory


I'm guessing they meant `ln -s`.


Symbolic links with nonexistent targets are legitimate and useful.

They can be used to stash information that isn't intended to be
a pointer to an object in the file system at all:

   ln -sf "" hash

A symlink is essentially a tiny text file where you can store
info, subject to some easy-to-meet restrictions.

Dangling links can be prepared in a file system structure that will
be installed somewhere, where the links will resolve:

  ln -sf /etc/alternatives/netcat $(DESTDIR)/bin/netcat

An option to emit a warning could be mildly useful, but it's
nothing you can't check yourself *after* the symlink is made:

  ln -sf $TARGET $LINK   # quoting elided for brevity

  [ -e $LINK ] || printf "warning: link target %s doesn't exist" $TARGET

Doing the check before the link is made is more involved because
a relative link target is resolved relative to the link's location.

Plus if it is buggy, then it won't match what the operating system
says; the ultimate arbiter of whether the link is dangling is
to create it and actually test it.




Re: [PATCH] copy: disallow copy_file_range() on Linux kernels before 5.3

2021-05-18 Thread Kaz Kylheku (Coreutils)

On 2021-05-12 16:09, Pádraig Brady wrote:

copy_file_range() before Linux kernel release 5.3 had many issues,


Remark: although there is nothing wrong with the patch, and it is
necessary, this seems like an issue for the C library to handle,
as well.

- The GNU C library provides the function copy_file_range. The
  fact that this is a linux kernel feature is abstracted by the library.

- The GNU C library knows what version of the kernel it is running on,
  and provides workarounds in relation to that (or possibly refuses
  to run at all).

So, arguably, the responsibility of somehow working around this lies
with glibc; glibc is the purveyor of the API.

Unfortunately, I don't see any discussions about the issue in
libc-alpha mailing list.

The situation is not acceptable that a GNU program is working around
broken GNU library functions, with no action being taken in the
GNU library (even if that program will need to carry those workarounds
anyway).

The most recent commits which mention copy_file_range in the
commit message are:

2020-03-03 	 Florian Weimer	Linux: copy_file_range syscall number is 
always available

2019-07-08   DJ Delorie NEWS: clarify copy_file_range
2019-06-28 	 Florian Weimer	io: Remove copy_file_range emulation [BZ 
#24744]


(I am obviously assuming that any commit related to this issue
will have "copy_file_range" in the commit message.)

The comment for that third one is interesting:

  The kernel is evolving this interface (e.g., removal of the
  restriction on cross-device copies), and keeping up with that
  is difficult.  Applications which need the function should
  run kernels which support the system call instead of relying on
  the imperfect glibc emulation.

Glibc should not be providing a function like this at all
until it stabilizes. Programs wanting to use a bleeding edge
kernel call should use a some syscall macro to generate it
themselves.

It sounds as if now that it may have stabilized, glibc should
be offering it under a different name/alias.

Programs that need a reliable copy_file_range can then just
refer to that new name which indicates reliable semantics. They
detect that name in their configure scripts, and build and
link against that name. Those programs will then refuse to
run against a glibc which doesn't export that name.
Problem solved.



Re: [PATCH] ls: add --sort=width (-W) option to sort by filename width

2021-04-16 Thread Kaz Kylheku (Coreutils)

On 2021-04-09 15:51, Pádraig Brady wrote:

On 09/04/2021 13:02, Carl Edquist wrote:

Dear Coreutils Maintainers,

I'd like to introduce my favorite 'ls' option, '-W', which I have been
enjoying using regularly over the last few years.

The concept is just to sort filenames by their printed widths.


(If this sounds odd, I invite you hear it out, try and see for 
yourself!)



I am including a patch with my implementation and accompanying tests - 
as

well as some sample output.  And I'll happily field any requests for
improvements.


I quite like this. It seems useful.
Also doing outside of ls is quite awkward,
especially considering multi column output.


Ah, but not so!

What is awkward is doing the sorting outside of ls, using only
the shell and utilities.

The multi column output can be done by feeding the sorted list of
files to ls, with the -df options (don't list directories, don't sort).

Demo:

ls -f | gawk -f sizesort.awk
.  buf.h   time.c   arith.h   alloca.h  y.outputgenman.txr
.. txr.h   hash.h   txr.vim   struct.h  lex.yy.cgencadr.txr
ID lib.c   tree.c   combi.c   socket.h  parser.hMETALICENSE
tstbuf.c   glob.c   arith.c   parser.y  filter.creconfigure
wintxr.c   cadr.h   chksums   parser.l  termios.h   genvmop.txr
txrffi.h   eval.h   sysif.c   inst.nsi  termios.c   LICENSE-CYG
optftw.h   glob.h   regex.h   chksum.c  linenoise   config.make
mpijmp.S   args.c   debug.h   signal.c  configure   sizesort.awk
tags   ffi.c   hash.c   y.tab.h   syslog.h  protsym.c   .gdb_history
pack   lib.h   utf8.c   tags.tl   stream.c  lisplib.h   checkman.txr
gc.c   rand.c  time.h   regex.c   unwind.c  strudel.c   genprotsym.txr
gc.h   args.h  combi.h  INSTALL   itypes.c  strudel.h   y.tab.c.shipped
vm.c   rand.h  debug.c  match.c   syslog.c  gs_YEC3Hr   y.tab.h.shipped
vm.h   utf8.h  HACKING  unwind.h  chksum.h  optand.tl   txr-manpage.pdf
.git   cadr.c  parser.  filter.h  signal.h  gs_P4Z02S   HACKING-toc.txr
txr.1  tl.vim  sysif.h  itypes.h  RELNOTES  gs_8aK1VJ   lex.yy.c.shipped
share  eval.c  match.h  struct.c  parser.c  gs_G7H2OA   ChangeLog-2009-2015
tests  tree.h  LICENSE  config.h  socket.c  lisplib.c
ftw.c  vmop.h  y.tab.c  Makefile  stream.h  genvim.txr

Source code of sizesort.awk (which uses GNU Awk extensions):

#!/usr/bin/awk -f

function compare(ia, a, ib, b)
{
   return length(a) - length(b)
}

{ dir[NR] = $0 }

END {
  asort(dir, sdir, "compare")
  for (x in sdir) {
 print sdir[x] | "xargs ls -fd"
  }
}

But this doesn't handle arbitrary file names. However, Awk can
process null terminated/separated records, as put out by find -print0:

Hold my beer:

$ find . -print0 | awk -v RS='\0' '{print$1}' | head
.
./rand.c
./args.h
./termios.h
./combi.h
./rand.h
./gencadr.txr
./unwind.h
./txr.1
./termios.c

Proof of concept with sorting:

$ find . -maxdepth 1 -print0 | awk -rf sizesort0.awk
../txr.c   ./args.c   ./regex.c   ./chksum.h   ./gs_G7H2OA
./ID ./ffi.h   ./hash.c   ./INSTALL   ./signal.h   ./lisplib.c
./tst./ftw.h   ./utf8.c   ./match.c   ./RELNOTES   ./genvim.txr
./win./jmp.S   ./time.h   ./unwind.h  ./parser.c   ./genman.txr
./txr./ffi.c   ./combi.h  ./filter.h  ./socket.c   ./gencadr.txr
./opt./lib.h   ./debug.c  ./itypes.h  ./stream.h   ./METALICENSE
./mpi./rand.c  ./HACKING  ./struct.c  ./y.output   ./reconfigure
./tags   ./args.h  ./parser.  ./config.h  ./lex.yy.c   ./genvmop.txr
./pack   ./rand.h  ./sysif.h  ./Makefile  ./parser.h   ./LICENSE-CYG
./gc.c   ./utf8.h  ./match.h  ./alloca.h  ./filter.c   ./config.make
./gc.h   ./cadr.c  ./LICENSE  ./struct.h  ./termios.h  ./sizesort.awk
./vm.c   ./tl.vim  ./y.tab.c  ./socket.h  ./termios.c  ./.gdb_history
./vm.h   ./eval.c  ./arith.h  ./parser.y  ./linenoise  ./checkman.txr
./.git   ./tree.h  ./txr.vim  ./parser.l  ./configure  ./sizesort0.awk
./txr.1  ./vmop.h  ./combi.c  ./inst.nsi  ./protsym.c  ./genprotsym.txr
./share  ./time.c  ./arith.c  ./chksum.c  ./lisplib.h  ./y.tab.c.shipped
./tests  ./hash.h  ./chksums  ./signal.c  ./strudel.c  ./y.tab.h.shipped
./ftw.c  ./tree.c  ./sysif.c  ./syslog.h  ./strudel.h  ./txr-manpage.pdf
./buf.h  ./glob.c  ./regex.h  ./stream.c  ./gs_YEC3Hr  ./HACKING-toc.txr
./txr.h  ./cadr.h  ./debug.h  ./unwind.c  ./optand.tl  
./lex.yy.c.shipped
./lib.c  ./eval.h  ./y.tab.h  ./itypes.c  ./gs_P4Z02S  
./ChangeLog-2009-2015

./buf.c  ./glob.h  ./tags.tl  ./syslog.c  ./gs_8aK1VJ

sizesort0.awk:

function compare(ia, a, ib, b)
{
   return length(a) - length(b)
}

{ dir[NR] = $0 }

BEGIN {
  RS = "\0"
}

END {
  asort(dir, sdir, "compare")
  for (x in sdir) {
 printf "%s\0", sdir[x] | "xargs -0 ls -fd"
  }
}

What we could use here is a "ls -0" option that is like "ls -1" but with
null termination. And likewise some option to have ls read file names
from standard input. Line-wise by default, or null-terminated if -0
is specified.

So easy in a language with more well-rounded functionality:

1> (run "ls" (cons "-fd" [sort (get-line

Re: version-sort ugliness or bugs

2021-04-16 Thread Kaz Kylheku (Coreutils)

On 2021-04-15 18:44, Erik Auerswald wrote:

Hi,

On Thu, Apr 15, 2021 at 11:47:34PM +0200, Vincent Lefevre wrote:

I'm currently using version-sort in order to get integers sorted
in strings (due to the lack of simple numeric sort like in zsh),
but I've noticed some ugliness. This may be bugs, not I'm not sure

[ ... ]
I think all of your problems ("ugliness") is caused by the concept of 
"file

extensions" in GNU Coreutils version sort.

https://www.gnu.org/software/coreutils/manual/coreutils.html#Special-handling-of-file-extensions


That strikes me as a very poor set of requirements. The treatment
of suffixes is extremely hacky, and unnecessary.

Here is an algorithm + implementation I hacked up in 15 minutes.  Here 
is

the informal spec. Note that it makes no mention of special case hacks
for suffixes, yet suffixes end up treated reasonably:

1. A string is parsed into tokens. There are three kinds of tokens:
   - DOT: (".")
   - INT: decimal string (e.g. "123")
   - TXT: sequence of other characters

2. INT tokens are converted to integer values.

3. The token sequence is parsed in order to shore up
   INT DOT INT { DOT INT }* ... sequences into (INT INT ...) lists.

4. Any other INT token not placed into a list is turned into the
   a list of one integer (INT)

Then, the resulting sequence is compared as follows:

- TXT-TXT comparisons are ordinary lexicographic

- LIST-LIST comparisons are lexicographic on the list of integers

- Otherwise, the sorting order is DOT < TXT < LIST

Sample implementation in TXR Lisp.  Note: to achieve DOT < TXT,
we replace "." tokens with the character object #\.
The TXR Lisp less function then takes care of it:

 (less #\a "a" '(1 2 3)) -> t


Run:

$ txr versort.tl
abc.txt
abc-1d.2c.tar.gz
abc-1.2.tar.gz
abc-1.2c.tar.gz
abc-1.2.3.tar.gz
abc-1.2.3-3.14.tar.gz
abc-1.2.3-4.5.tar.gz
abc-1.2.3-9.tar.gz
abc-1.2.3-9.tgz
abc-1.2.3-9-sig.bin
abc-1.2.3.3.14.tar.gz
abc-2-tar.gz
abc-11-tar.gz
foo.txt
zzz-3.0
zzz-4.0
zzz-xyz-4.5
zzz-xyz-9.15.3

Code in versort.tl

abc-1d.2c.tar.gz is before abc-1.2 because d is not part of the version 
number.

This is a case of version 1 coming before 1.2.

(Don't have trailing junk in your version numbers, except possibly at 
the very

end; keep them numeric!)


(defun ver-tok (str)
  (tok #/\.|\d+|[^\d.]+/ str))

(defun ver-parse (str)
  (let ((all-toks (ver-tok str)))
(labels ((convert (toks)
   (mapcar [iffi (fr^ #/[0-9]/) toint] toks))
 (parse (:match)
   (((@(integerp @a) "." @(integerp @b) . @rest))
 (parse (cons (list a b) rest)))
   (((@(integerp @a) . @rest))
 (parse (cons (list a) rest)))
   (((@(listp @a) "." @(integerp @b) . @rest))
 (parse (cons (append (flatten a) (list b)) rest)))
   ((("." . @rest)) (cons #\. (parse rest)))
   (((@a . @rest)) (cons a (parse rest)))
   ((@else) else)))
  (parse (convert all-toks)

(defun ver-recombine (vsyntax)
  (cat-str (mapcar [iffi consp
[chain (op mapcar tostring)
   (ap join-with ".")]]
   vsyntax)))

(defun ver-sort (strings)
  [mapcar ver-recombine (sort [mapcar ver-parse strings])])

(let ((data '("abc-1.2.3.tar.gz"
  "zzz-4.0"
  "abc-11-tar.gz"
  "abc-2-tar.gz"
  "abc-1d.2c.tar.gz"
  "abc-1.2c.tar.gz"
  "abc-1.2.3-9-sig.bin"
  "abc-1.2.tar.gz"
  "abc-1.2.3-9.tar.gz"
  "abc-1.2.3-3.14.tar.gz"
  "abc-1.2.3.3.14.tar.gz"
  "abc-1.2.3-9.tgz"
  "zzz-3.0"
  "foo.txt"
  "abc.txt"
  "abc-1.2.3-4.5.tar.gz"
  "zzz-xyz-9.15.3"
  "zzz-xyz-4.5")))
  (tprint (ver-sort data)))




Re: [PATCH] cksum: Use pclmul hardware instruction for CRC32 calculation

2021-03-14 Thread Kaz Kylheku (Coreutils)

On 2021-03-14 12:55, Jeffrey Walton wrote:

The underlying problem is GCC, Clang and friends conflate the user's
ISA with ISA the compiler uses. They are not the same - they are
distinct. Unfortunately, GCC and Clang never addressed the underlying
problem.


Sorry, what does that mean?

GCC works fine as a cross-compiler. E.g. built to run on the x86_64 ISA, 
but putting out Aarch64 code.


The "Submodel" options of GCC are determined by the configuration: how 
that GCC was built.


On x86 9and maybe others), there is a "native" argument for -march and 
-mtune as in -march=native.


If GCC is configured that way, it will generate code according to the 
processor of the machine it is running on. (Unless, I'm guessing, it's 
built as a cross-compiler, so the build machine's architecture is 
irrelevant.)





Re: [PATCH] cksum: Use pclmul hardware instruction for CRC32 calculation

2021-03-12 Thread Kaz Kylheku (Coreutils)
On 2021-03-12 07:33, Kristoffer Brånemyr via GNU coreutils General 
Discussion wrote:

Hi,
I was just wondering if you are planning to merge the change, or if
you decided against it? :)I wanted to use the cpuid.h autoconf
detection for another patch I'm working on.


Regarding the comment "Since the time the process spends
waiting on syscalls (fread) is still the same, actual real
time speedup is only 3x. It would be an interesting exercise
to try to use async IO, so you could checksum one block while
reading the next. Maybe I will try that one day."

You never know, but probably not. If the 3x performance was
achieved with with a hot cache, then async I/O probably isn't
going to do anything, since everything is in RAM already.
When the cache is pre-loaded, the I/O syscalls are pure
CPU overhead, since nothing is waiting on any real I/O.

I would try these improvements, in order:

- Don't use stdio fread, which is an extra layer of calls
  and buffering over read. Use read, and play with different
  buffer sizes.

- Use mmap to map the file to memory, and then crc32 that buffer.

In the non-hot-cache case where async I/O might help, you can
likewise get a potential improvement with mmap by using madvise
with MADV_SEQUENTIAL to give it a hint that you're performing
sequential access (which benefits from reading ahead).







Re: Add dry-run option to mv

2021-03-10 Thread Kaz Kylheku (Coreutils)

On 2021-03-10 13:59, L A Walsh wrote:

On 2021/03/07 03:20, Emilio Garcia wrote:

   Hi all,

   I checked out the coreutils repo on Github and I would like to ask 
you

   to add a dry-run option in the mv command.
  

   When I've needed such functionality, I insert an 'echo'
before the 'mv' command, so in a script:

cmd=eval
if ((dry_run)); then
 cmd=echo
fi


Me too; but that doesn't validate the arguments like Emilio wants.

E.g.

  mv --dry-run existing-file nonexistent-dir/  # error
  mv --dry-run nonexistent-object somewhere # error
  mv --dry-run object /another/filesystem # diagnostic





Re: Sorting SNMP numeric OID's?

2021-02-22 Thread Kaz Kylheku (Coreutils)

On 2021-02-22 07:31, Ed Fair via GNU coreutils General Discussion wrote:

Has it ever been discussed to add an option to the sort utility for
sorting numeric SNMP object identifiers by sub-identifier?


Probably not, but what has been discussed is sorting version numbers 
like 1.2.3.


How are SNMP OIDs different from version numbers for the purposes of 
sorting?


GNU Coreutils sort supports -V/--version sort. Do you know about it, and
have you tried it?

What requirements in relation to SNMP OIDs are not met by this feature?




Re: chmod: man page clarification for letter X

2020-10-08 Thread Kaz Kylheku (Coreutils)

On 2020-10-08 10:28, Tomás Fernandes wrote:

Hello,

I've recently come across chmod's man page as someone who is not very
experienced (1st year CS undergrad), and found that the definition of 
the

letter X in the man page a bit unclear, more specifically this part (in
bold):


On the topic of chmod documentation, it could use a clarification in the
following matter.

chmod supports a = operator for copying permissions. For instance
u=g means "make the u permissions be like g".

chmod also supports multiple operations, like

  u=g,g=o

The behavior of GNU Coreutils chmod is that the = operator samples the
most *new* value of the permissions (everything to the left has already
taken place). This is true even without the comma separation, when =
is combined in one clause with other operators, as in:

  o+x=o

Here, the o+x will apply the x permission to o. Then this effect is 
settled

and the =o assignment therefore has no effect; it's the same as:

  o+x,o=o

Or something like that; I've not looked at this stuff in a while,
but it was one of the issues I ran into when making a chmod 
implementation.


It would be good if the documentation spelled it out that = references
the new permissions which result from everything to the left of =
having been processed.





Re: chmod: man page clarification for letter X

2020-10-08 Thread Kaz Kylheku (Coreutils)

On 2020-10-08 10:28, Tomás Fernandes wrote:

Hello,

I've recently come across chmod's man page as someone who is not very
experienced (1st year CS undergrad), and found that the definition of 
the

letter X in the man page a bit unclear, more specifically this part (in
bold):

execute/search only if the file is a directory or already has execute
permission
*for **some user* (X)


In my opinion this could be worded slightly better and more clearly as:

execute/search only if the file is a directory or already has execute
permission
*for **user, group or other* (X)


How about "X behaves like x if the object already has any execute
permission bit set, otherwise it has no effect."




Re: wc feature request

2020-10-06 Thread Kaz Kylheku (Coreutils)

On 2020-10-05 08:40, A B wrote:
Many thanks for all the much needed contributions to society as a 
whole.


I did have one feature to request for wc, which I think would be
highly complementary to grep’s -q flag. It would be really cool if wc
could have a -q flag as well, which could return matches within a
predefined threshold as the exit code itself. So for example, if I
wrote ‘wc -l -q’ at the end of a pipe, then no output would be
returned, but the exit  code would return a 3 if three lines were
found.


I don't see this exact feature in the documentation of GNU grep.
grep terminates with a 0 status (success) when matches are found,
and this is true with -q.

The idea has limited applicability; there are only as few as 8 bits
(or fewer?) available in the process status word for encoding the
exit code.

It could be useful for counting the number of lines or characters
in files that are somehow guaranteed to be small.

The inversion of the exit success polarity is also troubling.
If nothing is counted, that's 0 (success), whereas if anything
is counted, that is a termination failure.




Re: What is the interpretation of bs of dd in terms of predicting the disk performance of other I/O bound programs?

2020-09-28 Thread Kaz Kylheku (Coreutils)

On 2020-09-23 09:56, Peng Yu wrote:

Hi,

Many people use dd to test disk performance. There is a key option dd,
which I understand what it literally means. But it is not clear how
there performance measured by dd using a specific bs maps to the disk
performance of other I/O bound programs. Could you anybody let me know
the interpretation of bs in terms of predicting the performance of
other I/O bound programs? Thanks.


The bs likely maps to performance like this:

perf (fraction of max)

 1.0|   ___-_
| _/
|   _/
|  /
| |
| /
   0||
+--||--
  bs   AB


A bs of zero is impossible, so we can call that point "no performance".
Ridiculously small values of bs will cause the program to be doing
too many system calls. The larger the bs, the fewer syscalls dd has
to make, so there is some improvement with diminishing returns until
the maximum theoretical performance is reached for that OS, hardware
and approach (read/write loop).  Then if bs gets ridiculously large,
so that the buffers don't fit into the on-chip CPU caches, then
there are almost certainly negative returns.

The range of sizes from A to B is probably wide enough, that an
intelligent guess at a good bs size is likely to land in it.



Re: date: unclosed date string comments

2020-08-06 Thread Kaz Kylheku (Coreutils)

On 2020-08-05 18:52, sunnycemet...@gmail.com wrote:

Hello.  Given this documentation:

Comments may be introduced between round parentheses, as long as 
included  parentheses are properly nested.


Is this considered a bug:


■ LC_ALL=C date -d '(test 1 2 3'
Wed Aug 5 00:00:00 EDT 2020
■ LC_ALL=C date -d '((test 1 2 3)'
Wed Aug 5 00:00:00 EDT 2020


Thank you.


This is a lack of diagnosis that adds up to it being a feature
that a comment which does not close is closed implicitly by
the end of the string.

Once this kind of thing escapes into the wild, the safest thing
is to document it. A fix for this (like making date exit with
a diagnostic and failed termination status) will break something
for someone somewhere.

Unix is historically awful for this sort of looseness. For
instance an extra closing parenthesis in an extended regex is
treated as literal. Too late to fix, this had to be enshrined
in POSIX scripture:

   The  shall be special when matched with
   a preceding , both outside a bracket
   expression.

Particularly in the earlier history of Unix, a lot of it
was geared toward getting the happy cases working with the
least amount of code.

At least the C compilers grew up quite a bit. You can fix
looseness in compilers more easily because when you tighten
something in a compiler, the diagnostic blows up in the face
of a developer.

If you tighten something in a utility, something breaks in
the field, because it is a run-time check.



Re: Enhancement Request for sha256sum - output only the SHA-256 hash alone

2020-07-19 Thread Kaz Kylheku (Coreutils)

On 2020-07-17 14:33, Pádraig Brady wrote:

On 17/07/2020 15:21, jens.archlinux jens wrote:

Hello,

propose to add a new option for sha256sum to output only the SHA-256 
hash

alone, without a trailing file name and without a trailing newline.

(This would make sense for one input file only).

It would make shell scripts that use sha256sum much simpler. Currently 
it
is necessary to split the output of sha256sum to obtain the hash, 
which

usually requires an additional command / Unix process.


This is one of those trade-offs.
I'd be 60:40 against adding such an option,
because it's so easy to implement with cut(1):


Can I muse about checksum utility design?

Someone once, who didn't understand Unix utility design principles,
had the dumb idea of polluting the output of a checksumming utility
with extraneous information. Somehow that became a meme for authors
of new checksumming utilities, though not so rigid a meme that they
would actually make those outputs compatible with their predecessors.

Maybe it was the same person who thought it's a good idea for "dd"
to output, by default, cruft like:

  0+0 records in
  0+0 records out
  0 bytes (0 B) copied, 0.726321 s, 0.0 kB/s

Did that person ever work at Microsoft on MS-DOS? It's suspiciously
reminiscent of:

  C:\Users\kaz>copy foo.txt bar.txt
  1 file(s) copied.

(Thank you; I would never be able to guess that one file was
copied from the fact that I specified one file, and the command's
termination status was successful).

I'm against adding the option for a this reason: the default
behavior of a checksum function should be to output nothing but
the checksum.

Note that the word "sum" is redundant in "sha256sum".

Thus there is an opportunity for a "sha256" utility which just outputs
nothing but the sum. That utility could bee sha256sum itself,
upon detecting that argv[0] ends in "sha256", though that is risque.

Also, that utility should perhaps calculate a continued sum when
given multiple arguments, and not individual sums. So that is to say:

   sha256 a b c
   sha256 <(cat a b c)

should be the same.

Now let's talk options. It should have two, -i and -f:

   sha256 -i  [ inputs ... ]

would calculate the hashes over the inputs, starting with the
specified state. The special  token of 0 (the ASCII
zero digit) would mean "the initial state". In the -i mode,
sha256 would output a string (in an unspecified, opaque format,
perhaps inspired by "stty -g") which encodes the newly updated
state. The string should have no quoting or escaping issues
for shell programming.

The output of sha256 -i would be suitable as an argument to
the -i option of a new command, to continue the hashing operation
over additional inputs. It would also be suitable as an argument
to -f, so that:

   sha256 -f  [ inputs ... ]

would process inputs (if any) just like sha256 -i , and then
do the hash finalization, and output not another state cookie, but
the final hash.

Thus, the output of

   sha256 a b c

could also be obtained using:

   st=$(sha256 -i 0)
   st=$(sha256 -i $st a)
   st=$(sha256 -i $st b)
   st=$(sha256 -i $st c)
   sha256 -f $st

or:

   st=$(sha256 -i 0 a b c)
   sha256 -f $st

or:

   st=$(sha256 -i 0)
   sha256 -f $st a b c

or, "point-free" application:

   sha256 -f $(sha256 -i 0 a b c)

etc.

I would add one more option: -s (literal string, not file name).

Whenever one or more -s options are present, their argument values
are pulled into the hash, in the order they appear, before any
files. Thus:

   $ sha256 -s coreutils
   3993c379c029014a9c4b2adf5d23397b3c7421467a0cb3575ff925bb6f6329b0

   $ sha256 -s core -s utils
   3993c379c029014a9c4b2adf5d23397b3c7421467a0cb3575ff925bb6f6329b0

   $ sha256 -f $(sha256 -i 0 -s core -s utils)
   3993c379c029014a9c4b2adf5d23397b3c7421467a0cb3575ff925bb6f6329b0


-i and -f are mutually exclusive, and must precede any -s options.





Re: mv w/mkdir -p of destination

2020-07-03 Thread Kaz Kylheku (Coreutils)

On 2020-07-03 14:38, Bernhard Voelker wrote:

On 2020-05-11 05:16, Vito Caputo wrote:

Does this already exist?

Was just moving a .tgz into a deep path and realized I hadn't created
it on that host, and lamented not knowing what convenient flag I could
toss on the end of the typed command to make `mv` do the mkdir -p
first for me.

I was surprised to not see it in mv --help or mv(1) when I checked for
next time...


mv(1) is ... well, for moving files and directories.


If we consider a filesystem to be a collection containing a name space
which assigns path names to objects, then mv is a tool for re-assigning
a new path name to an object.

Years ago I implemented this concept in a version control system.
It has a mv command which works regardless of whether directories
exist.

(In fact instead of using the mv command, you can edit the 
representation

of the directory structure, and then run an update command to re-shape
the workspace accordingly; mv works by doing the same thing.)

However, the tool did not store a representation of directories at all.
Just files and symbolic links.

If a directory-restructuring operation renames all the files out of
a directory, that directory is removed (unless it contains
local, untracked content). Its parent is removed if it becomes empty
and so on.

It works beautifully and is intuitive to use.


And creating the
destination
directory in the same go seems to be a seldom case.  We see also from
the missing
answers so far, that nobody seems to be much excited about this 
feature.

Anyway, as it's very easy to work around it with a separate mkdir(1)


Not to mention rmdir! If you have moved all content out of a directory,
you may not want it to exist. For symmetry "mv -p" should remove all
empty directories left behind, as far up the tree as possible and
as permissions will allow.


it's better
to avoid adding complexity to the code.


Of course, the complexity doesn't go away; i.e. it stays with the
user to grapple with. Though forty years of Unix users don't
seem to have minded all the various inconvenience, so why bother.



Re: Disable b2sum from coreutils?

2020-07-02 Thread Kaz Kylheku (Coreutils)

On 2020-07-01 22:18, Jeffrey Walton wrote:

Hi Everyone,

The BLAKE2 folks have optimized implementations for b2sum on i686,
x86_64, NEON and PowerPC. It also has more options than the coreutils
version.

I'd like to disable b2sum in coreutils and use the BLAKE2 team's 
version.


This is a job for your open source system distribution.

The approach taken on some distros is to build every package normally.
If two or more packages provide the same executable, there is a 
mechanism

in place to choose which one is installed.

It may be that all are installed, but under an altered name like say
"/usr/bin/b2sum.coreutils" and "/usr/bin/b2sum.blake2". Then the 
resolution

system chooses one of these as the target of a /usr/bin/bsum2
symbolic link.

The renaming and symlinking are done outside of the build systems of the
programs; they are arranged by the distro build system.

The distro build system redirects the "make install" of a package into
a temporary install directory which is for that package only. Then the
installed materials are further manipulated. For instance, the materials
may be split into development, run-time and documentation parts, which
become separate packages. It is at this stage that clashing executable
might be renamed.

The virtue of this system is that the end user has a way to choose which
binary dominates, without the upstream packages having to be rebuilt;
all the packages have all the binaries.

I think pretty much any major distro has a way to do this; find out how
yours is doing it.

If you're building your own local package from sources (like say blake2)
and would like its b2sum to be used instead of the one in /usr/bin,
then simply make sure that /usr/local/bin is ahead of /usr/bin in
your PATH.



Re: feature request: better intuitive syntax LINK=TARGET

2020-06-25 Thread Kaz Kylheku (Coreutils)

On 2020-06-24 19:35, Andrej Surkov wrote:

Hi all!

   ln syntax is very uncertain - every time I use ln I'm confused what 
is

   correct "ln -s LINK TARGET" or "ln -s TARGET LINK"! Of cause I can
   check man ln or ln --help, but what if we add unambiguous syntax, 
for

   example

   ln -s LINK=TARGET


mv existing new
cp existing new
ln existing new
ln -s rel-or-abs-path new

The confusing thing in the ln -s case is that if the path is relative, 
it is

resolved with respect to the directory of new, not the current directory
where the command is executing.

I suspect that this altered semantics of the source argument is actually 
the

root cause of then becoming confused about the order of the arguments.
Since the rel-or-abs-path often isn't a working path from the current
directory to the desired link target, but simply specifies the content
of the link, your brain thinks of the operation as a variable 
assignment:

stuffing the specified link with the given literal value.

I think this is what is fixed by the -r option of the GNU Coreutils ln.
If you use -r, then it's just:

ln -sr orig new   # like plain ln, cp or mv

With -r, if orig is relative, it is understood relative to the current
working directory, not to the link's directory. If the object exists,
then orig is the actual path from here to that object.

The orig and new paths are canonicalized, and then the relative path
R from $(dirname new-canon) to orig-canon is calculated. Then the
link is created as if ln -s R new.

I suspect if you start using "ln -sr original-item link-to-it",
and no longer think of the operation as stuffing a literally specified
piece of content into the variable-like link object, but as creating
a virtual copy of original-item named link-to-it, the recurrent
confusion may be cured.





Re: [PATCH] md5sum: add an option to change directory

2020-05-20 Thread Kaz Kylheku (Coreutils)

On 2020-05-20 14:15, Bertrand Jacquin wrote:

In the fashion of make and git, add the ability for all sum tools to
change directory before reading a file.

  $ sha256sum /etc/fstab
  b5d6c0e5e6bc419b134478ad7b3e7c8cc628049876a7772cea469e81e4b0e0e5  
/etc/fstab


Make requires this option because it reads a Makefile, and those often
contain relative references which assume that the Makefile's directory
is current.

The inputs to md5sum don't don't contain path references that break.

In other regards, every tool that does anything with files could
have a -C option:

Copying files:

cp -C /etc fstab fstab.bak

Executing a script:

sh -C /etc rc.local

Editing:

vi -C /etc fstab

Where does it end?


  $ sha256sum -C /etc fstab
  b5d6c0e5e6bc419b134478ad7b3e7c8cc628049876a7772cea469e81e4b0e0e5  
fstab


The net effect is that just the output has changed to omit the path 
name.


Maybe this wants to be a --strip or -p option like with diff or patch,
or --basename-only to strip a variable number of components, leaving 
only

the last.

If I want to print a simplified name, I don't want to do this:

   md5_short()
   {
  local dir=$(dirname "$1")
  local base=$(basename "$1")
  md5sum -C "$dir" "$basename"
   }

   md5short /path/to/whatever

I just want this:

   md5sum --basename /path/to/whatever

The -C functionality can easily be done with subshells, or with a 
chdir()

after fork(), before exec():

In a script, instead of "make -C dir", you can always do (cd dir; exec 
make).


In C, make yourself a spawning function that has the dir-changing 
functionality

built in:

   spawn_program(path_to_executable, /* or let PATH be searched */
 change_to_this_directory,
 use_these_args,
 these_env_vars);




Re: suggestion: /etc/dd.conf

2020-04-29 Thread Kaz Kylheku (Coreutils)

On 2020-04-28 02:14, turgut kalfaoğlu wrote:

I would like to suggest and in fact volunteer to create a conf file
option to 'dd'.


By doing that you're replacing function arguments with global variables, 
which is a bad idea.



It has dozens of hard to remember options, and there are some that I
would like to use all the time.


Look into shell functions and aliases.


For example, I am currently doing:

$  sudo dd if=CentOS-6.10-x86_64-LiveDVD.iso of=/dev/sdc bs=4096 
conv=fsync



right now, and I have to lookup the conv=fsync option every time I
want to write to a USB drive.


It's unlikely that this option is required; have you tried it without?

To make sure any buffered writes are flushed, do a "sync" after the 
entire dd operation.





Re: statically linking coreutils 8.32

2020-03-19 Thread Kaz Kylheku (Coreutils)

On 2020-03-19 01:54, Gabor Z. Papp wrote:

lo lo,

while trying to statically link coreutils 8.32 on linux x86_64, I'm
getting the following error:


Static linking has not been supported by Glibc for many years now; so 
you can at best get a program's own components to be static, but not 
down to fully static executable with linked-in libc.


For static executables, you have to use a C library that supports static 
linking: musl or whatever.





Re: failing CI jobs

2020-03-18 Thread Kaz Kylheku (Coreutils)
On 2020-03-18 04:27, "Toni Uhlig (Smartphone)" via GNU coreutils General 
Discussion wrote:

There are a lot of failing CI jobs and nobody seems to care about.
Some of them seem to fail since two+ years ago.
Why not disable them, if nobody cares about?

Source:
https://hydra.nixos.org/job/gnu/coreutils-master


Firstly, this specific page is not found; 404 error.

Secondly, more generally, surely NixOS is not hosting CI for GNU 
Coreutils development?





Re: altchars for base64

2020-03-15 Thread Kaz Kylheku (Coreutils)

On 2020-03-15 09:00, Assaf Gordon wrote:

Hello,

On 2020-03-15 12:12 a.m., Kaz Kylheku (Coreutils) wrote:

On 2020-03-14 22:20, Peng Yu wrote:

Python base64 decoder has the altchars option.

[...]

But I don't see such an option in coreutils' base64. Can this option
be added? Thanks.


# use %* instead of +/:
base64 whatever | tr '+/' '%*'


The reason for alternative characters is typically do use then in URLs,
where "/" and "+" are problematic.

A new command "basenc" was introduced in coreutils version 8.31
(released last year) which supports multiple encodings.


If your script has to work in installations that aren't up to
Coreutils 8.31, or don't use Coreutils at all (base64 comes from
somewhere else), you need the tr trick or its ilk.




Re: altchars for base64

2020-03-14 Thread Kaz Kylheku (Coreutils)

On 2020-03-14 22:20, Peng Yu wrote:

Hi,

Python base64 decoder has the altchars option.

https://docs.python.org/3/library/base64.html
base64.b64decode(s, altchars=None, validate=False)¶

But I don't see such an option in coreutils' base64. Can this option
be added? Thanks.


# use %* instead of +/:

base64 whatever | tr '+/' '%*'



Re: RFC: du reports a 1.2PB file on a 1TB btrfs disk

2020-03-11 Thread Kaz Kylheku (Coreutils)

On 2020-03-10 21:31, Jim Meyering wrote:

On Tue, Mar 10, 2020 at 12:24 PM Kaz Kylheku (Coreutils)
<962-396-1...@kylheku.com> wrote:

On 2020-03-10 11:52, Jim Meyering wrote:
> Otherwise, du provides no way of seeing how much of the actual disk
> space is being used by such FS-compressed files.

If you stat the file, what are the values of st_size, st_blksize and
st_blocks?


That particular file is long gone, but I've just created a 1.8T file
on a 700G file system.
Before I began this experiment, "Avail" was 524G, so it appears to
occupy about 60G actual space.


Sorry; forget I mentioned st_blksize; I forgot that st_blocks is
measured in 512 byte blocks regardless of st_blksize.

FTR, I created the file by running this: yes $(printf '%065535d\n' 0) > 
big


$ stat big
  File: big
  Size: 1957123607586   Blocks: 3822507048 IO Block: 4096   regular 
file


So here, the Blocks value (coming from st_blocks) doesn't inform us
differently from size; if we multiply it by 512, it matches the size
exactly.

The underlying FS can use the st_blocks value to indicate the actual
storage. For instance, if I do this on ext4:

   # dd of=file seek=$((1024 * 1024)) count=1 if=/dev/zero

Then:

   # du -h file
   12K file
   # du --apparent-size -h file
   513Mfile

The apparent size comes from the st_blocks information in the stat 
structure:


  # stat file
File: `file'
Size: 536871424   Blocks: 24 IO Block: 4096   regular 
file

  Device: 902h/2306d  Inode: 1624448 Links: 1
  Access: (0644/-rw-r--r--)  Uid: (0/root)   Gid: (0/
root)

  Access: 2020-03-11 04:22:26.0 -0700
  Modify: 2020-03-11 04:22:26.0 -0700
  Change: 2020-03-11 04:22:26.0 -0700

The issue you are seeing here is that btrfs should be probably be
publishing a st_blocks value that matches the actual storage,
accounting for sparseness and compression, and not just a repetition
of the size, rounded up to a block and quoted in 512 byte units.

The fidelity of the du output is only as good as what is in stat.




Re: RFC: du reports a 1.2PB file on a 1TB btrfs disk

2020-03-10 Thread Kaz Kylheku (Coreutils)

On 2020-03-10 11:52, Jim Meyering wrote:

Otherwise, du provides no way of seeing how much of the actual disk
space is being used by such FS-compressed files.


If you stat the file, what are the values of st_size, st_blksize and 
st_blocks?




Re: ls feature request

2020-02-21 Thread Kaz Kylheku (Coreutils)

On 2020-02-21 10:32, Riccardo Mazzarini wrote:
Hi Kaz, this works almost perfectly but it fails with filenames that 
contain spaces.

I tried using quotation marks, i.e.

ls -dU "$(find .* * -maxdepth 0 -not -type d | sort ; find .* * 
-maxdepth 0 -type d | sort)"


but that didn't work. Any ideas?


I can answer that in three parts of increasing complexity. The remaining 
caveat
is that since we are relying on passing all names as arguments to a 
single invocation
of "ls", these solutions are all susceptible to the kernel's argument 
passing limit.


Part 1:

Solutions involving capturing the output of a program and interpolating 
it as arguments
for ls will not work. Or if they are made to work, they will require a 
clumsy

escaping-and-eval job. So we switch to another method.

If the only issues with names are spaces and control characters, but no 
spurious
newlines, so that the output of "find" has exactly one name per line, 
then we

can use xargs:

  (find .* * -maxdepth 0 -not -type d | sort ; find .* * -maxdepth 0 
-type d | sort) | xargs ls -dU


Note that xargs cannot use your shell alias for ls. If you want colors, 
you have to add

--colors=auto

Part 2:

If the names can be completely arbitrary strings, and include newlines, 
then we
have "find -print0" that will output names as null terminated strings, 
and we

have "xargs -0" that reads null-terminated strings.

What we don't have is a "sort" that does null-terminated string I/O.

But, what we do have is GNU Awk. GNU Awk can separate input according to
arbitrary records, using a regular expression.  In GNU Awk's regular 
expression

syntax, we can specify the null byte as \0.

Watch this. Here is a little test directory with some files:

  ~/test $ ls
  cert.pem  char.c  hello.c  Makefile  palin.tl  str.sh
  char  hello   lex.awk  notreached.c  pushl.s

We can pass these as null-terminated strings with "find -print0" to a 
gawk script which

handles them just fine and prints them as newline-terminated lines:

  $ find . -print0 | gawk -v 'RS=\0' 1
  .
  ./hello
  ./Makefile
  ./lex.awk
  ./palin.tl
  ./char
  ./char.c
  ./hello.c
  ./cert.pem
  ./notreached.c
  ./pushl.s
  ./str.sh

And with -v 'ORS=\0', it will output null terminated records too! But we 
won't be

making use of this.

With the above, we can implement a sort easily:

# Null terminated string sort using GNU Awk
gawk -v 'RS=\0' '{ line[NR] = $0 } END { asort(line); for (l in 
line) { printf("%s\0", line[l]); } }'


It's quite a mouthful, so let's move the RS assignment into a BEGIN 
block and put the whole

awk script into a variable called sort0:

sort0='BEGIN { RS = "\0" } { line[NR] = $0 } END { asort(line); for 
(l in line) { printf("%s\0", line[l]); } }'


With that variable, we can now have:

   (find .* * -maxdepth 0 -not -type d -print0 | gawk "$sort0" ; find .* 
* -maxdepth 0 -type d -print0 | gawk "$sort0" ) | xargs -0 ls -dU



Part 3:

Since we're using Gawk, we could run a single "find" job and use logic 
inside the Gawk
script to do the separation of directories and non-directories. To 
distinguish the two,
we can use GNU find's -printf instead of -print0.  We can print 
directory names with

a "d" prefix, and other entries with a "-" prefix.

My attempt at this script looks like this:

#!/bin/bash

(find .* * -maxdepth 0 \
   \( -not -type d -printf "-%p\0" \) -o \
   \( -type d -printf "d%p\0" \) ) | \
gawk 'BEGIN { RS = "\0" }
  /^-/ { nondir[NR] = substr($0, 2) }
  /^d/ { dir[NR] = substr($0, 2) }
  END { asort(nondir)
asort(dir)
for (l in nondir)
  printf("%s\0", nondir[l]);
for (l in dir)
  printf("%s\0", dir[l]); }' | \
xargs -0 ls -dU --color=auto


As we want, the script handles the case when I have a file created 
using:


  $ touch 'foo
  bar'

it ends up displayed as 'foo'$'\n''bar', indicating that it got passed
through correctly through the plumbing all the way to the final ls -dU.




Re: ls feature request

2020-02-20 Thread Kaz Kylheku (Coreutils)

On 2020-02-20 16:01, Riccardo Mazzarini wrote:
The ls programs currently provides a "--group-directories-first" 
option, to
group directories before files. I'd be nice to have the opposite 
option,
"--group-directories-last" or "--group-files-first", to group files 
before

directories.


Workaround applicable in "ls -l" case:

ls -l | awk '/^d/ { dirs = dirs $0 "\n"; next } 1 ; END { printf("%s", 
dirs); }'






Re: BUG in sort --numeric-sort --unique

2020-02-13 Thread Kaz Kylheku (Coreutils)

On 2020-02-13 14:00, Stefano Pederzani wrote:

In fact, separating the parameters:
# cat controllareARCHIVIO_2020/02/controllare20200213.txt | sort -u |
sort -n | wc -l
1262
we workaround the bug.


My own experiment shows confirms things to be reasonable.

When -n and -u are combined, then uniqueness is based no numeric
equivalence. Since numeric equivalence is weaker, de-duplication
based on numeric equivalence can cull out more records than
de-duplication based on textual equivalence.

$ printf "0\n00\n000\n" | sort -u
0
00
000
$ printf "0\n00\n000\n" | sort -n
0
00
000
$ printf "0\n00\n000\n" | sort -nu
0
$ printf "0\n00\n000\n" | sort -n | sort -u
0
00
000
$ printf "0\n00\n000\n" | sort -u | sort -n
0
00
000

As you can see, sort -nu is not equivalent to any combination
of sort -n and sort -u.   sort -nu has de-duplicated a file of
different "spellings" of zero down to a single entry.

sort -u may not de-duplicate these entries because "0"
is textually different from "00".


Every line is only something like "1.2.3.4".


Unfortunately, "sort -n" will probably not do what you think with
this data.

Please read sort's GNU Info documentation; the man page lacks
detail about what numeric sorting means.

Also, the POSIX standard's description of -n:

https://pubs.opengroup.org/onlinepubs/9699919799/utilities/sort.html

In short, what -n does is recognize a *prefix* of each line as a number
according to a pattern that includes optional blanks, an optional sign,
digits, a radix character, and digit group separators.

-n does not deal with compound numeric identifiers like 1.2.3.4.

Basically 1.2.3.4 and 1.2.4.4 both look like the number 1.2.

$ sort -nu
1.2.3.4
1.2.4.4
1.2.5.6
[Ctrl-D][Enter]
1.2.3.4

Oops! This result is correct; under numeric sort (-n), all these lines
are considered to have the key 1.2.  And if we de-duplicatd based on 
that,

they are all considered to be duplicates; they de-duplicate down to
a single line.







Re: gcc10's -Wreturn-local-addr gives FP warning about lib/careadlinkat

2020-02-06 Thread Kaz Kylheku (Coreutils)

On 2020-02-06 09:05, Jim Meyering wrote:

On Thu, Feb 6, 2020 at 6:03 AM Pádraig Brady  wrote:

On 06/02/2020 00:27, Jim Meyering wrote:
> Building latest latest coreutils using latest-from-git gcc10 evokes
> this false positive:
>
> lib/careadlinkat.c: In function 'careadlinkat':
> cc1: error: function may return address of local variable
> [-Werror=return-local-addr]
> lib/careadlinkat.c:73:8: note: declared here
> 73 |   char stack_buf[1024];
>
> I'm guessing improved flow analysis will eventually suppress this. I
> hesitate to turn off the useful and normally-high-S/N
> -Wreturn-local-addr globally. Maybe just disable it in that one file,
> temporarily?

The logic of the function looks fine.
Would an `assure (buf != stack_buf)` before the `return buf`
indicate that constraint to gcc with minimal runtime overhead?


I would have preferred that, but it has no effect.
I then tried to suppress that warning in the affected file by adding
these lines:

/* Without this pragma, gcc 10.0.1 20200205 reports that
   the "function may return address of local variable".  */
# pragma GCC diagnostic ignored "-Wreturn-local-addr"

But, surprisingly, that didn't help, either.
Also tried Kaz Kylheku's return-early suggestion, to no avail.


I have other thoughts about this.

There is a well-known technique for using a array for small
arrays up to a certain size, after which a dynamic array is used.

That technique is useful in cases when dynamic allocation is
avoided entirely.

If the array has to eventually go into
dynamic storage, because it is returned, then it can just
start out that way.

So that is to say, this justification for the stack_buf is
pretty poor:

   /* Allocate the initial buffer on the stack.  This way, in the
  common case of a symlink of small size, we get away with a
  single small malloc() instead of a big malloc() followed by a
  shrinking realloc().  */

The common case is in fact small symlinks; I can't remember
when I've seen a symlink that was anywhere near a kilobyte long.

If you start with, say, a 128 byte malloc buffer, or even a 64
byte one, there is hardly any need to realloc that to a smaller size,
and doing so for chunks of that size might not even actually
make any memory available, depending on the allocator.

The vast majority of symlinks you will ever read will fit into
128 bytes.

Also think about this: depending on the exact filesystem,
small symlinks are stored directly in the inode (or perhaps
even directory entry?)  Whereas large symlinks have to
go to a separate block.

So, okay, there is an overhead *inside* readlink for fetching
a large symlink, and that overhead dwarfs the user-space
concern of whether an extra realloc is called.

readlink may have to read a whole extra data block of storage
containing the symlink, on a cache-cold system. That could
result in a disk seek, bloating up the time into a range
measured in milliseconds.

Basically, the initial guessimate of the required space for
a symlinks should probably be more or less aligned with a
reasonable estimate of the symlink size that is efficiently
handled at the filesystem level.




Re: gcc10's -Wreturn-local-addr gives FP warning about lib/careadlinkat

2020-02-06 Thread Kaz Kylheku (Coreutils)

On 2020-02-05 16:27, Jim Meyering wrote:

Building latest latest coreutils using latest-from-git gcc10 evokes
this false positive:

lib/careadlinkat.c: In function 'careadlinkat':
cc1: error: function may return address of local variable
[-Werror=return-local-addr]
lib/careadlinkat.c:73:8: note: declared here
   73 |   char stack_buf[1024];

I'm guessing improved flow analysis will eventually suppress this.


By chance, does this make it go away away (my changes in #else parts of 
#ifdef)?


  if (link_size < buf_size)
{
  buf[link_size++] = '\0';

  if (buf == stack_buf)
{
  char *b = (char *) alloc->allocate (link_size);
  buf_size = link_size;
  if (! b)
break;
  memcpy (b, buf, link_size);
#ifdef OLD
  buf = b;
#else
  return b;
#endif
}
  else if (link_size < buf_size && buf != buffer && 
alloc->reallocate)

{
  /* Shrink BUF before returning it.  */
  char *b = (char *) alloc->reallocate (buf, link_size);
#ifdef OLD
  if (b)
buf = b;
#else
  if (b)
return b;
#endif
}

  return buf;
}





Re: What is the difference between unlink and rm -f?

2020-01-29 Thread Kaz Kylheku (Coreutils)

On 2020-01-29 01:45, Peng Yu wrote:

Hi,

It seems to me unlink and rm -f are the same if the goal is the delete
files. When are they different? Thanks.


I answered this on Unix Stackexchange in 2016:

https://unix.stackexchange.com/a/326711/16369

:)



Re: Is it safe to replace dd?

2020-01-20 Thread Kaz Kylheku (Coreutils)

On 2020-01-20 04:14, microsoft gaofei wrote:

Many people suggest using dd to create bootable USB,
https://www.archlinux.org/download/ . But cp and mv also writes to
USB, e.g., cp archlinux-2020.01.01-x86_64.iso /dev/sdb, cat
archlinux-2020.01.01-x86_64.iso > /dev/sdb. Is it safe to use these
commands instead of dd? If it's unsafe, I want to know the reason.


dd was required on ancient Unix systems for dealing with "raw" devices
that had mandatory block sizes.

For instance, if a raw device such as a hard disk or tape drive, had
a block size of 512, then writing to it required a sequence of correctly
sized write system calls. If the program wrote 512 bytes, the device
driver would truncate the write to 512. If the program wrote fewer
than 512 bytes, then it wouldn't completely overwrite the block, yet
the position would advance to the next block. Maybe garbage would
be left in the partial block, or zeros.

With reads there would be a similar problem. A 256 byte read on
a raw device with a 512 block size would result in a truncated
read (very reminiscent of a truncated UDP datagram receive).

The dd program's block size feature would ensure that reads and writes
involving raw devices were performed correctly. With dd you can
read from a raw device with 256 byte blocks, and output to a device
with 1024 byte blocks, an operation called "re-blocking".

The block devices you're working with in a GNU/Linux system aren't
raw. You can write to them in whatever request sizes you want.
The aggregation into correct transfer units is done by the block
driver software inside the kernel.

There is a small advantage in writing a multiple of the block
size, For instance, suppose we write to a block device like /dev/sda1
one byte at a time. Each time we write a byte, an entire block is
edited in-memory to change that byte, and then the entire block is
flushed out to the device, usually asynchronously. By writing bytes,
we risk reduced performance: that the same block of the device will
be wastefully dirtied and flushed two or more times.

However, it's very unlikely that the buffer sizes used by standard
utilities like "cp" are not good multiples of a block size.
Block sizes are almost always powers of two, and buffers in
file copying utilities are also, and larger than typical block sizes.

dd has features that are not found in other utilities, such as
seeking into arbitrary positions in the source and destination and
copying only certain amounts.

dd can also work with devices that are infinite sources of bytes;
with dd you can read 1024 bytes from /dev/urandom, which can't
be done with cat or cp.

If you need to do any of these things, you need dd, or something
like it.




Re: Regarding compilation of coreutils.

2020-01-06 Thread Kaz Kylheku (Coreutils)

On 2020-01-06 11:53, Sandeep Kumar Sah wrote:
previously i edited ls.c to print "Hello World" before listing content 
in a

directory.
Now i have deleted the coreutils folder and everything underneath it.
I want to get the original version of ls command for which i am unable 
to

build the source file, it tells me that
"checking for a BSD-compatible install... /usr/local/bin/install -c
checking whether build environment is sane... configure: error: ls -t
appears to fail.  Make sure there is not a broken
  alias in your environment
configure: error: newly created file is older than distributed files!


Did you install this modified ls into your /bin?

Or is it in some non-system location that happens to be listed in your 
PATH?


If you didn't clobber your system ls, so this is just a PATH issue, 
either

edit PATH, or find out where this modified ls is and remove/rename it.

If you clobbered your /bin/ls, you may be able to use your GNU/Linux
distro's packaging system to refresh the installation.

Assuming you added something like:

  printf("Hello, World\n");

to the code, then you can edit /bin/ls with a binary editor, such as,
oh, "vim -b /bin/ls".  Find the "Hello World" string, and overwrite
the "H" with a null byte to reduce it to zero length. Save the 
executable

and try it. If it's something like

  puts("Hello, World");

where the newline is implicit in the function behavior, you may have
to find the instructions which make this call and overwrite them with
NOP (byte value 0x90 on Intel x86, IIRC).

Other ideas/hacks:

- Copy a working /bin/ls from another system that is identical or 
similar to yours.
  E.g. say you're on 64 bit Ubuntu 18. If you happen to have 64 bit 
Ubuntu 16,

  that system's /bin/ls should work.

- Go into the Coreutils configure system and try to
  defeat the test for a working "ls -t". Maybe the result of the test
  is not needed for the sake of building a working ls.

- Rename the funny ls binary to ls-funny, and write a /bin/ls shell
  script wrapper which calls ls-funny "$@", and filters out the Hello, 
World

  first line of output, as in something like:

#!/bin/sh
/bin/ls-funny "$@" | sed -n -e '2,$p'

- Absolute last resort of the utter coward: Boot some rescue DVD-ROM. 
Mount
  your install partition and copy the live system's /bin/ls into your 
install

  partition's /bin/ls.







Re: [PATCH] sleep: allow ms suffix for milliseconds

2019-12-09 Thread Kaz Kylheku (Coreutils)

On 2019-12-08 21:46, sunnycemet...@gmail.com wrote:

On 2019-12-02 13:58, Stephane Chazelas wrote:

With GNU coreutils sleep (and ksh93's builtin but not that of
bash or mksh) one can add a e-3 suffix to get miliseconds (and
e-6 for us and e-9 for ns)

sleep 1 # s
sleep 1000e-3 # ms
sleep 100e-6 # us
sleep 10e-9 # ns


Thank you for the trick (and Berny for the documentation patch).
It's new to me, but I guess that's what I get for not investigating
the  info page's notes.


Though it's a nice trick, it obviously depends on the value not
having an E exponent already. When that can be assumed or assured,
it's useful, no doubt.






Re: [PATCH] sleep: allow ms suffix for milliseconds

2019-11-29 Thread Kaz Kylheku (Coreutils)

On 2019-11-29 09:38, Bernhard Voelker wrote:

On 2019-11-29 14:30, Rasmus Villemoes wrote:

When one wants to sleep for some number of milliseconds, one can
do (at least in bash)

  sleep $(printf '%d.%03d' $((x/1000)) $((x%1000)))

but that's a bit cumbersome.


Why not use floating-point numbers directly?

  $ sleep 0.01234


I think the point is that the above example is doing exactly that,
but it has to convert from a value x which is in milliseconds.
The shell has only integer arithmetic, so a clumsy expression is
required.

If the shell had floating arithmetic, it would just be this:

   sleep $((x / 1000))

With GNU dc we can do:

   sleep $(dc -e "3k $x 1000/p")



Calling sleep(1) with a small milliseconds argument seems anyway a very
rough hammer, because the overhead to launch the executable is larger
than the actual nanosleep(2) system call.


Well, nobody says that the x value is in the range [0, 1000).

Sleeping for 15500 milliseconds is valid.

But in any case, we can already do that with "sleep 15.500".

The issue is that it's cumbersome to convert from 15500 to 15.500
in a shell script.

That's the problem to fix.

Next time someone will need another such conversion in another
context, and then yet another; we can't be adding units suffixes
into every utility that takes numeric arguments.

Fix the right problem in the right place.

That goes especially for issues that aren't blockers; there is
no urgency to address this problem with a quick fix like "sleep 123m"
because the cumbersome shell code works fine.




Re: [PATCH] sleep: allow ms suffix for milliseconds

2019-11-29 Thread Kaz Kylheku (Coreutils)

On 2019-11-29 05:30, Rasmus Villemoes wrote:

* src/sleep.c: Accept ms suffix.
* doc/coreutils.texi (sleep invocation): Document it.

When one wants to sleep for some number of milliseconds, one can
do (at least in bash)

  sleep $(printf '%d.%03d' $((x/1000)) $((x%1000)))

but that's a bit cumbersome. Extend sleep(1) to also accept "ms" as a
suffix, so one can instead do

  sleep ${x}ms


Could be worth it to accept a few other suffixes like "us" and "ns".




Re: question about SI/IEC in df

2019-11-28 Thread Kaz Kylheku (Coreutils)

On 2019-11-28 10:16, Kaz Kylheku (Coreutils) wrote:

But, let me remark, using KB, MB,  for powers of 1000 is neither
metric, nor grounded in tradition.  If it's all caps like KB and MB,
it's clearly 1024-based just like without the B. There has to be a


Sorry about that, this is flatly wrong; lower case b is bits: kb is a 
kilobit.





Re: question about SI/IEC in df

2019-11-28 Thread Kaz Kylheku (Coreutils)

On 2019-11-28 04:39, Krzysztof Labus wrote:

In the manual I see:
The SIZE argument is an integer and optional unit (example: 10K is
10*1024).  Units are K,M,G,T,P,E,Z,Y (powers of 1024) or KB,MB,...
(powers of 1000).

1. Why df not using Ki, Mi, Gi etc. in powers od 1024 ??


- Wastes space.
- Flouts tradition.
- Scripts in the wild depend on the details of utility output; don't 
mess with it.

- It's ultimately "bike shedding".

But, let me remark, using KB, MB,  for powers of 1000 is neither metric, 
nor grounded in tradition.  If it's all caps like KB and MB, it's 
clearly 1024-based just like without the B. There has to be a lower-case 
b, and proper casing of the scale: k M g t p.  M is capitalized because 
m stands for milli, but the b won't be capitalized, hence Mb.


References: https://en.wikipedia.org/wiki/Kilobyte

"The internationally recommended unit symbol for the kilobyte is kB."

"In some areas of information technology, particularly in reference to 
digital
memory capacity, kilobyte instead denotes 1024 (210) bytes. This arises 
from the
powers-of-two sizing common to memory circuit design. In this context, 
the symbols

K and KB are often used."








Re: feature request du/find

2019-10-31 Thread Kaz Kylheku (Coreutils)

On 2019-10-30 13:14, Benjamin Arnold wrote:

Hi,

thanks a lot for your quick response.

Sorry, i must have missed the -links option, that's exactly what i am
looking for.


Unfortunately, it's a component in an incorrect solution.
A file tree being backed up can contain hard links (e.g. two
executables in /usr/bin being the same file).

The general condition we must look for is that if the tree has
N directory entries pointing to the same object O, then O is entirely
contained in that tree if its link count is N.
Otherwise the count must be > N, and there are links to it
elsewhere.

A static -links predicate in find will not do it.

In my case du would have counted "twice", because the other hard link 
is

not in the directory du is searching in.


I think this should be a feature of do; and likely du has most
of the pieces in place to make this easy.

It already identifies multiply linked objects. All it needs is
a flag which will cause it to disregard the size of any
any object which has more links than the number of times
du has encountered that object.

The obvious algorithm will have an effect on the order in which du
reports objects. When the option is in effect, du will show the
path which references the *last* occurrence of the object (in the
traversal order). E.g. if some object with link count = 3 is
processed, the first two appearances of it won't be reported and
counted, but when the third one is seen, du can be sure that there
are no other references and can then tally the object's size
and report on it.

This algorithm will naturally cull the objects with excessive link
counts: the condition for reporting them and adding them to the
total simply isn't reached.



Re: How to implement the V comparsion used by sort in python?

2019-10-27 Thread Kaz Kylheku (Coreutils)

On 2019-10-26 16:05, Peng Yu wrote:
Are you sure they are 100% compatible with V? I don’t want to use them 
just

later find they are not 100% compatible.


"are you sure various Python packages are compatible with sommething
vaguely described in a some years-old obscure blog post" doesn't
seem like a great question for the Coreutils mailing list.

Try a Python forum?




Re: Does head util cause SIGPIPE?

2019-10-25 Thread Kaz Kylheku (Coreutils)

On 2019-10-25 00:56, Ray Satiro wrote:

Recently I tracked down a delay in some scripts to this line:

find / -name filename* 2>/dev/null | head -n 1


(Here 'filename*' should be quoted, because we want find
to process the pattern, not for the shell to try to expand it.)

Interestingly, POSIX neglects to say whether head is required to
quit after dumping the required number of lines, or whether it
terminates (thereby abruptly closing its standard input, possibly
causing a broken pipe error in the upstream process).

(Of course head can be given two or more file arguments, in
which case of course it doesn't quit until the last one is processed.)

In a RATIONALE paragraph, though POSIX says that head, for a
single file, could be simulated using "sed 10q"; and that *will*
quit immediately and break the pipe.


It appeared that the searching continued after the first result was
found. I expected head would terminate and SIGPIPE would be sent which
would cause find to terminate, but that did not happen.



Since I was in
cygwin I thought maybe it was a Windows issue but I tried in Ubuntu 16


SIGPIPE not working in Cygwin cannot possibly be a "Windows issue",
since Windows has no such thing as SIGPIPE; it would be a Cygwin issue.


with an aribtrary file and noticed the same thing.


I don't see it in Unbuntu 16 or 18 at all. "find / | head" shows ten
lines, and after that, there is no evidence of any find process
continuing to execute. If I run:

   find / | head && ps -aux | grep find

the grep process only finds itself in the output of ps; I tried about
20 times in a row, hoping to catch a briefly lingering find, but
nothing.





Re: Compile Coreutils without xattr but i installed

2019-10-10 Thread Kaz Kylheku (Coreutils)

On 2019-10-10 11:56, Wei MA wrote:

I compile the source code. And when i ran   tests/cp/capabiliy.sh, cp
preserves attr failed without xattr support . Then i installed xattr.
I deleted coreutils and downloaded it again. The problem still exists.


A configure problem likely won't be due to a bad copy of coreutils;
you have to debug where that is going wrong: why it isn't detecting
the xattr.

It looks like the Coreutils configure script looks for two headers:
 and .

It also looks for a function attr_copy_file in libattr.

See: http://git.savannah.gnu.org/cgit/coreutils.git/tree/m4/xattr.m4


I use Ubuntu 18. When i ran cp of Ubuntu, the same commands has no
problem.


A possibility may be to find the Ubuntu 18 build recipe for coreutils
and find out what it's doing differently from you. Does it pass
something to the configure script or force any Autoconf variables
(ac_cv_whatevers). Does it apply any patches, etc.




Re: md5sum and recursive traversal of dirs

2019-10-10 Thread Kaz Kylheku (Coreutils)

On 2019-10-10 10:29, Сергей Кузнецов wrote:

Hello, I find it strange that md5sum does not yet support recursive
directory traversal. I moved part of the code from ls and added this
functionality. How about adding this? I also added sha3 and md6 
algorithms,

they are in "gl/lib/".


If we have any utility whatsoever that operates on files, sometimes
we want to apply it to every file in a tree.

It does not follow that every utility whatsoever that operates on files
should integrate the code for traversing a tree.

We have ways in the shell, and in other programming languages, to
map any operation over a tree of files.

The mapping mechanism maps, the MD5 mechanism calculates MD5 sums;
each has a single responsibility.

One noteworthy tree traversal mechanism appears as an extension in the
GNU Bourne-Again Shell (Bash). In Bash, if you set the "globstar" option
like this:

   shopt -s globstar'

If this is enabled, then the ** operator becomes active in file globbing
patterns. The ** operator spans across multiple path components. For
instance:

  # calculate the md5sums of all .so files anywhere in /usr/lib
  md5sum /usr/lib/**/*.s

By the way, I wrote two new small programs: xchg (or swap, which name 
is

better?) And exst (exit status).


Exchanging two files can be implemented as a shell function, which can 
be

extremely simple if we don't worry about exchanging files in different
filesystem volumes. Here is a sketch:

  swap()
  {
local tmpname=$(mktemp swap-XX)
# ... check arguments here for count and sanity ...
mv -- "$1" $tmpname
mv -- "$2" "$1"
mv -- $tmpname "$2"
  }


The first program simply changes files,
the number of which can be more than two.


That should proably be called "rotate", like the rotatef operator in
Common Lisp. The logic becomes something like (untested):

   # if we have at least two arguments:
   if [ $# -gt 1 ] ; then
 mv -- "$1" $tmpname

 # while we have two or more arguments
 while [ $# -gt 1 ] ; do
   mv -- "$2" "$1"
   shift
 done

 # last argument gets $tmpname
 mv -- $tmpname "$1"
   fi

Example: rotate some logs:

   rotate deleteme log.2 log.1 log.o log
   rm deleteme

Undoubtedly elegant and useful; but should it be a C program in GNU 
Coreutils? Hardly.



The second program launches the
program indicated at startup and, after its completion, prints the 
output

status or the caught signal.


Doable in shell scripting, again. The status of the last command is 
available

in the $? variable. This can be tested:

   stat=$?

   if [ $(( stat & 0x80 )) != 0 ] ; then
  printf "terminated due to signal %d\n" $((stat & 0x7F))
   else
  printf "exited with status %d\n" $stat
   fi

Bash has "massaged" the value already. That is to say, if the program
terminates normally with an unsuccessful status 19 we don't have to do
any shifting to recover the value from the upper bits of an exit status
word; $? simply holds the value 19.




Re: Can natural sort support be added?

2019-10-08 Thread Kaz Kylheku (Coreutils)

On 2019-10-08 00:47, Peng Yu wrote:

Hi,

Since natural sort is provided in a few languages (as mentioned in the
Wikipedia page). Can it be supported by `sort` besides just
version-sort?

https://en.wikipedia.org/wiki/Natural_sort_order


This page has no precise specific definition of natural sort order.

Its external references have poor credibility, consisting of a blog 
entry
at "codinghorror.com" and links to some implementations of something in 
Perl,

PHP, Python and Matlab.

A link to some IETF RFC or ISO standard or other major document is 
required.






Re: [PATCH] chown: prevent multiple --from options

2019-09-29 Thread Kaz Kylheku (Coreutils)

On 2019-09-29 02:46, Francois wrote:
We can fix by rejecting the cases where --from option is provided 
multiple

times and uid or gid are set twice.


An more sophisticated fix is to allow the --from to be given multiple 
times,
but have the resulting range be the intersection of all of the ranges 
given.


Each successive --from applies clipping to the range calculated so far.

If a uid or gid are given twice, but match, that should be fine too; why 
not.





Re: Wishing rmdir had a prompt

2019-09-02 Thread Kaz Kylheku (Coreutils)

On 2019-09-02 01:03, Sami Kerola wrote:
I am not a maintainer, but I don't see any problem adding --interactive 
long
only option. Getting a short option may clash with future posix 
requirement,

so I believe they are not handed out without really good reasons.


Fear not; POSIX standardization is not ignorant of significant 
implementations

like those from GNU.

For example, here is a literal quote from Issue 7 (2018) POSIX's awk 
page:


"The undefined behavior resulting from NULs in extended regular 
expressions allows future extensions for the GNU gawk program to process 
binary data."


https://pubs.opengroup.org/onlinepubs/9699919799/utilities/awk.html

"GNU -> GNU Now (mentioned in) Unix"

:)



Re: Wishing rmdir had a prompt

2019-09-01 Thread Kaz Kylheku (Coreutils)
On 2019-09-01 17:50, Leslie S Satenstein via GNU coreutils General 
Discussion wrote:

rmdir -i


I don't see this in a fairly recent GNU Coreutils 8.28 installation. 
Must be very new?


There is some justification for such a thing. Though it may seem that 
accidental
deletion of empty directories is easy to restore by recreating them, 
that does not
put the file system back in the pre-deleted state. The modification 
timestamps of
parent directories are tweaked, and new empty directories come up with 
their own
modification timestamps as well as different inode numbers, permissions, 
group and user
ownerships and such. There is no substitute to not deleting something; 
what comes close

is recovering it from a backup.




Re: /bin/echo -- $var

2019-08-15 Thread Kaz Kylheku (Coreutils)

On 2019-08-15 00:53, Harald Dunkel wrote:

IMHO they should have kept the "no args allowed" for echo
("in the late 70s") and should have introduced a new tool
"eecho" instead.


Well, new tool for printing was introduced under the name "printf".




Re: /bin/echo -- $var

2019-08-14 Thread Kaz Kylheku (Coreutils)

On 2019-08-14 05:01, Harald Dunkel wrote:

Hi folks,

I just learned by accident that

var="-n"
/bin/echo -- $var

actually prints

-- -n

Shouldn't it be just

-n
?


According to POSIX, echo doesn't take options. It is specified
that "Implementations shall not support any options."
(We have options, though,  so things are complicated.)

Furthermore, the specification explicitly speaks of -- thusly:

"The echo utility shall not recognize the "--" argument in the manner
specified by Guideline 10 of XBD Utility Syntax Guidelines; "--"
shall be recognized as a string operand."





Re: How to convert a md5sum back to a timestamp?

2019-08-01 Thread Kaz Kylheku (Coreutils)

On 2019-07-31 20:36, Peng Yu wrote:

Hi,

Suppose that I know a md5sum that is derived one of the timestamps
computed below. Is there a way to quickly derive what the original
timestamp is? I could make a database of all the timestamps and their
md5sums. But as the total number of entries increases, this solution
will not be scalable as the database can be big. Is it there any
better solution to this problem?

for i in {1..2563200}; do date -d "-$i minutes" +%Y%m%d_%I%M%p; done


The solution to this is to back up several levels in whatever you are
working on, and restructure the approach to the real problem in such a
way that the flowchart box which says "And here we just crack MD5 sums" 
is

somehow eliminated.




Re: Possible ls bug?

2019-03-20 Thread Kaz Kylheku (Coreutils)

On 2019-02-26 13:10, Bartells, Paul wrote:

I have encountered behavior with ls -R that appears to be incongruous.
My actual command line entry is: ls -alR /kyc_mis/dev/*/*/paul/* >
pauldev.lst.

[ ... ]


/kyc_mis/dev/rpts/paul/kyc:
total 599
-rwxrwx---  1 pb82477 kycmis 262144 Oct 31 17:06 
kyc_excepreport_old_20181022.sas7bdat

drwxrwx--- 10 pb82477 kycmis176 Oct 24 16:14 ..
drwxrwx---  2 pb82477 kycmis 55 Oct 31 17:06 .


What is odd here is that name of the directory being listed doesn't
match the pattern /kyc_mis/dev/*/*/paul/ (assuming we can trust that
line of the output to be true: ls is really listing a directory by
that name).

It has a "kyc" component where the pattern expects "paul".

You haven't told "ls" to follow symlinks, either.

Unless the filesystem has cycles, or ls has started following ".."
links, we should not be ending up in this directory from any starting
point that matches the command line pattern.






Re: [PATCH] df: Adding a --no-headers option, by request of Bruce Dubbs

2019-03-20 Thread Kaz Kylheku (Coreutils)

On 2019-03-17 05:27, Ed Neville wrote:

Taking suggestions into account, '--no-headers' seems more consistent
with ps options.


This loses on character count:

  df --no-headers
  df | sed 1d

Fully golfed:

  df|sed 1d

Oops!




Re: FAQ confusing terminology regarding GNU and Linux Relationship

2018-10-17 Thread Kaz Kylheku (Coreutils)
On 2018-10-16 18:58, fdvwc4+ekdk64wrie5d8rnqd9...@guerrillamail.com 
wrote:

Under the section in the FAQ about uname, it refers to ``the Linux
kernel." Is not the GNU position that Linux should be referred to as
``Linux, the kernel' or something similar?


The GNU position is that an operating system distribution consisting of
GNU programs and libraries on top of Linux shouldn't just be called
"Linux" because that's just the name of the kernel.

The appositive expression "Linux, the kernel" is misleading, in fact,
because it insinuates that Linux needs to be qualified in this manner,
because there is some relevant Linux that isn't a kernel.

Linux, the kernel---as opposed to Linux, the what?

It only makes sense in a sentence like "we're talking about Linux,
the kernel, not Linux, the Swiss laundry detergent".




Re: RFC: rm --preserve-root=all to protect mount points

2018-06-11 Thread Kaz Kylheku (Coreutils)

On 2018-06-10 23:14, Pádraig Brady wrote:

I was asked off list to consider adding an option to rm
that could be enabled with an alias, and would protect
mount points specified on the command line.

[...]

  $ rm -r --preserve-root=all /dev/shm
  rm: skipping '/dev/shm', since it's a mount point
  rm: and --preserve-root=all is in effect


The command option is well-named, but consider changing
"mount point" to "mount" in this diagnostic and, more
importantly, any documentation which refers to this.

E.g. "since a filesystem is mounted there",
"since it is a filesystem root", etc.

I think the "mount point" terminology is misleading because
one important sense of the word is that it refers to the
Unix kludge of requiring an empty directory to exist for
a mount. The empty directory where one intends to mount a
filesystem is the "mount point" for it.

This option cannot protect directories which are mount
points in that sense; only ones that are carrying mounts.




Re: mv --recursive

2018-06-01 Thread Kaz Kylheku (Coreutils)

On 2018-06-01 04:08, Grady Martin wrote:

Hello.  I have two questions:

· Is there a way to recursively merge two directories with move (not
copy/delete) operations, using standard GNU utilities?
· If not, how do coreutils' maintainers feel about an
-r/-R/--recursive patch for mv?


We can almost do this already, with cp, except that the files also 
remain a the source.



■ mv -R old new
■ ls -R


I.e.

cp -rl old/. new/.

The new/ tree is populated with hard links to corresponding objects in 
old, which is what mv will do (on the same filesystem, anyway).


Basically, if cp had an option called "--remove-source", which does what 
its name says, I think it would do what you want.


cp itself could optimize using hard linking when that option is 
specified, and the source and destination directories are on the same 
filesystem, which supports linking.


cp with --remove-source would just about obsolesce mv.




Re: performance bug of `wc -m`

2018-05-24 Thread Kaz Kylheku (Coreutils)

On 2018-05-20 16:43, Bruno Haible wrote:

Kaz Kylheku wrote in
https://lists.gnu.org/archive/html/coreutils/2018-05/msg00036.html :

In what situation are there printable characters in the range [0, 
UCHAR_MAX) that

have a width > 1?


That's the wrong question. The question is which characters in this 
range

have width > 1 or <= 0.

The program below shows that the answer (on a glibc system) is:
The character 0x00AD (= SOFT HYPHEN) is printable but has width == 0.


I tried printing this on several terminals; all actually render 
something that is one character position wide.


A program which calculates column positions on a terminal will be wrong 
if 0xAD

has been printed, and it relies on this bogus datum from glibc.




Re: performance bug of `wc -m`

2018-05-17 Thread Kaz Kylheku (Coreutils)

On 2018-05-13 15:05, Philip Rowlands wrote:

In the slow case, wc is spending most of its time in iswprint /
wcwidth / iswspace. Perhaps wc could learn a faster method of counting
utf-8 (https://stackoverflow.com/a/7298149); this may be worthwhile as
the trend to utf-8 everywhere marches on.

I can't explain without more digging why Python's string
decode('utf-8') is better optimised for length calculations.


On the surface, it seems easy to explain: the Python program is
just decoding UTF-8 and then taking the length. None of that
requires character classification and determination of display width.

If "wc -m" is doing something with display with, it's very different
from what the Python is doing.

What are the requirements underpinning "wc -m", and how do these
iswprint and iswspace functions fit into it?

POSIX says this: "The -c option stands for "character" count,
even though it counts bytes. This stems from the sometimes erroneous
historical view that bytes and characters are the same size.
Due to international requirements, the -m option (reminiscent of
"multi-byte") was added to obtain actual character counts."

I don't see how this amounts to having to call iswspace and all that.

Nowhere does POSIX say that the display width of a character
has to be obtained in "wc" and I don't see that in the GNU documentation
either.




Re: performance bug of `wc -m`

2018-05-17 Thread Kaz Kylheku (Coreutils)

On 2018-05-16 17:13, Eric Fischer wrote:

I also found wcwidth to be a bad performance bottleneck in my multibyte
branch of coreutils. To fix the problem in my branch, I added a cache 
of
the widths returned for characters in the range from 0 to UCHAR_MAX 
(which
perhaps should also be widened to include a few other common 
alphabets).

The caching code is at the bottom of


In what situation are there printable characters in the range [0, 
UCHAR_MAX) that

have a width > 1?

The lowest-numbered Unicode character that requires two spaces is 
U+1100, I think.





Re: is there a real escape "quoting" style for ls?

2018-05-15 Thread Kaz Kylheku (Coreutils)

On 2018-05-13 09:30, Harald Dunkel wrote:

On 5/13/18 1:08 PM, L A Walsh wrote:


If you look under --quoting-style, you'll
see:
 --quoting-style=WORD   use quoting style WORD for entry names:
    literal, locale, shell, shell-always,
    shell-escape, shell-escape-always, c, escape

I haven't verified, but it looks like one of the options
with the word 'shell' in it might be more in line w/what you
want...



Maybe you should.

c   "A Knight's Tale: Part 2"
escape  A\ Knight's\ Tale:\ Part\ 2
literal A Knight's Tale: Part 2
locale  'A Knight\'s Tale: Part 2'
shell   "A Knight's Tale: Part 2"
shell-always"A Knight's Tale: Part 2"
shell-escape"A Knight's Tale: Part 2"
shell-escape-always "A Knight's Tale: Part 2"

bash command line completion gives me one of

A\ Knight\'s\ Tale\:\ Part\ 2
"A Knight's Tale: Part 2"
'A Knight'\''s Tale: Part 2'


The colon character doesn't require escaping for the purposes of
command line processing; the character has no special meaning in the
shell syntax. (Of course there is a : command, but that's not via
special treatment of the character.)

Bash's completion, however, assumes that unescaped colons are separators
of PATH-like lists.

If you have a file called foo:bar and you type echo foo:b[Tab] it will
not complete on it; it treats foo:bar as a PATH-like list of two 
independent
items, and tries to complete on just the "b". You will have to type 
foo\:[Tab]

to get the foo\:bar completion, or "foo:[Tab]

But that escape is not actually necessary for the processing
of the command line. It makes no difference: the word foo\:bar
produces the same argument as foo:bar.

Ever the burning question: what are you trying to do? How are you
blocked from doing that by colons not being escaped in the output of ls?

Are you trying to copy and paste a *partial* escaped filename from
the output of ls and then Tab-completing on it?

In that case, sure, this style will not do:

  $ A\ Knight's\ Tale:\ [Tab]

But this style will work:

  $ "A Knights' Tale: [Tab]






Re: Difference in binaries present in old and new versions of gnu tools

2018-05-03 Thread Kaz Kylheku (Coreutils)

On 2018-05-02 23:23, Mathai, Eldho (Nokia - IN/Bangalore) wrote:

After the make install we could see many binaries are missing in the
latest when compared with our existing old version. Can you help me
here to know why these binaries are missing and where can I get the
latest versions of these missing binaries.


Do you honestly think that "mysql_upgrade_shell" and "omniidlrun.py"
are GNU core utilities???





Re: env: add -S option (split string for shebang lines in scripts)

2018-04-25 Thread Kaz Kylheku (Coreutils)

On 2018-04-24 22:09, Pádraig Brady wrote:
I was thinking that the explanation of -S in usage() would say 
something like:


  -S, --split-string=S  process and split S into separate arguments
  used to pass multiple arguments on shebang 
lines


One little problem with this FreeBSD design is that in spite of
supporting variable substitution, it lacks the flexibility of
specifying where among those arguments the script name is inserted!

Say we have a "#!/usr/bin/env -S lang ..." script called "foo.lang".

Suppose we want it so that "lang -f foo.lang -x" is invoked, where
-f is the option to the lang interpreter telling it to read the
script file foo.lang, and -x is an option to the foo.lang script
itself.

It would be useful if the -S mechanism could indicate this insertion
of foo.lang into the middle of the arguments.

The variable expansion mechanism could do it. Suppose that a certain
reserved variable name like ${ENV_COMMAND} (not my actual suggestion)
expands to the name of the last argument received by env.

Furthermore, when env is asked to expands this variable one or more
times, it makes a note of this and then sets a flag which suppresses
the script name from appearing in its usual position at the end.

Then this is possible:

#!/usr/bin/env -S lang -f ${ENV_COMMAND} -x





Re: Multibyte support for sort, uniq, join, tr, cut, paste, expand, unexpand, fmt, fold, and pr

2018-03-27 Thread Kaz Kylheku (Coreutils)

On 2018-03-20 15:18, Assaf Gordon wrote:

Two things for later (not critical for now):
to make review easier, it's recommended to combine all commits that
relate to a single program
into one commit. This is called "squash" in git (see:
http://gitready.com/advanced/2009/02/10/squashing-commits-with-rebase.html


All little commits that achieve one logical change should be squashed.
For instance, don't have

   239d4f9 foo: implement feature X.
   3df77ab foo: fix missing semicolon in new X.
   93df301 foo: fix null pointer deref due to X.

These little incremental fixes in the development feature X shouldn't be
published as separate changes; only the polished, debugged, reviewed 
"feature X".


However, changes pertaining to different development "topics" should
never be squashed into one commit.

Two commits being to the same program in GNU Coreutils are not 
automatically

the same topic.

E.g. these would be wrong to squash together:

   100df03 ls: implement quoting for whitespace.
   69d34d0 ls: fix bad indentation in several functions.

Review is certainly not easier when multiple changes are combined.
I don't want to review some change in logic, under the distraction of
numerous whitespace changes, or changes in unrelated logic.


https://blog.carbonfive.com/2017/08/28/always-squash-and-rebase-your-git-commits/
).


That is just lunacy. Certainly, you should cleanly cherry pick or rebase 
all changes
onto a single mainline without a crazy merge graph: that much of it is 
true.


Squashing all changes is poor.

"Patch bombs" (big changes that combine multiple topics in one diff) 
will

not pass review in any shop that understands version control.




Re: ls is broken, what's next cd?

2018-02-06 Thread Kaz Kylheku (Coreutils)

On 2018-02-06 01:30, Michael wrote:

On 06/02/2018 08:13, Bernhard Voelker wrote:

On 02/06/2018 12:41 AM, Michael Felt wrote:
imho, the main problem is you change the default behavior, and 43 
years

of programs are broken.


no, as explained it does not affect programs and scripts, because this
only changes the output to the terminal.


Yes, I thought about that too. So, maybe I would have liked the choice
to be able to have them quoted IF and/or WHEN I needed to cut/paste
names. But now I must either not install coreutils (as I have that
option) or always remember to add three characters (' -N') everytime I
want normal ls output.


Are you saying that even names without spaces are being quoted?

If you only see these quotes for idiotic file names, then there is
really no issue.

Nobody should even listen to your complaint, because it is
prompted by the fact that you have such names in your filesystem,
which automatically makes you wrong in that same face of 43 years
of Unix alluded to upthread.

I'm ideologically opposed to this -N thing myself, or anything
which caters to these ill-conceived file names.

However, practically speaking, sometimes professionals who
do not themselves name things that way can fall "victim" to
people who do. If you have to deal with someone else's filesystem
or tarball or whatever, it does behoove you if your ls disambiguates
things for you.



Re: ls is broken, what's next cd?

2018-02-05 Thread Kaz Kylheku (Coreutils)

On 2018-02-05 07:18, Андрей Кинзлер wrote:

Hi, 

After upgrading my distro, I was quite surprised when I saw some of my
files wrapped with single quotes. Why adding such a useless feature of
an unnecessary garbage to the terminal output? Were you inspired by
Microsoft when fixing something that wasn't broken and now the files
are totally misaligned when you type 'ls'? 


+1

There is no need for this pointless garbage. Unix has gotten along
without quoting the output of ls for 43 years now.

Programs which parse the output of "ls" are broken.

They are also anachronistic; fixing the output of "ls" amounts
to a solution that someone needs in 1987, not in 2018.

Any halfway decent scripting language nowadays gives you some access
to readdir; if not via its library than FFI. Plus utils for globbing,
walking the file system and so on.

Never mind halfway decent scripting language; even in POSIX shell
scripts, there is no need to read the output of ls.




Re: Why is `find -name '*.txt'` much slower than '*.txt' on glusterfs?

2018-01-23 Thread Kaz Kylheku (Coreutils)

On 2018-01-19 20:26, Peng Yu wrote:

Hi,

There are ~7000 .txt files in a directory on glusterfs. Here are the
run time of the following two commands. Does anybody know why the find
command is much slower than *.txt. Is there a way to change the api
that `find` uses to search files so that it can be more friendly to
glusterfs?


A wild guess: find is calling stat on every directory entry that it 
reads?


What do you see if you run these commands under "strace"?

On GNU/Linux, programs that search through directories can avoid calling
stat in many cases by taking advantage of the "d_type" field in
"struct dirent". Maybe this doesn't work on glusterfs?

The *.txt syntax (that specific case of it) doesn't have to stat any
inodes because it just cares about the names whether they are 
directories

or other objects.








Re: Would a patch adding color to cat(1) be accepted?

2017-10-10 Thread Kaz Kylheku (Coreutils)

On 10.10.2017 07:03, Leslie S Satenstein wrote:

My RESPONSE
KISS


Hey, why not? Next year, everyone's embedded system will have twice the 
flash.


Then they can stop using BusyBox and switch to Coreutils with colorized 
everything!





Re: cp, ln, mv, install: check for vulnerable target directories

2017-09-21 Thread Kaz Kylheku (Coreutils)

On 21.09.2017 11:03, Kaz Kylheku (Coreutils) wrote:

On 21.09.2017 09:18, Kaz Kylheku (Coreutils) wrote:

On 20.09.2017 18:59, Paul Eggert wrote:

Kaz Kylheku (Coreutils) wrote:


Instead of checking for what *could* go wrong, why not defend more
specifically against signs that the attack might be actually 
happening.


That's what the patch is trying to do, though it looks like it should
be improved.


There is a simple operating system fix for this: do not allow 
processes
to create symlinks in directories to which they only have write 
accesses via

S_IWOTH.


Two additional notes:

Rather than a hard-coded behavior, this could be a "nolink" mount 
option,

somewhat analogous to "nodev" (deny use of device nodes present in the
filesystem).


I completely missed the full value latent in this analogy.

Just like "nodev" doesn't prevent creation of device nodes with mknod,
but is aimed at curtailing their *use*, the proposed "nolink" mount 
option

could similarly prevent the *traversal* of symlinks created in a shared
directory rather than blocking the means of their creation.

Suppose user mallory creates a symlink in a directory where multiple 
non-root
users have write access. Then have it so that only mallory can follow 
the

symlink (being the owner of the link).

A symlink owned by mallory in a directory that is writable to alice
shall not be dereferenceable by alice; there will be an EPERM when alice
tries to traverse it.

(If the filesystem is mounted "nolink".)


The permission denial would have to apply, of course, not only when a
new symlink is created via the symlink system call, but also to:


That takes care of the problem of trying to police various ways
in which a symlink can sneak in. Let the symlink appear by any means,
and just render it inoperable.

The criterion used by "nolink" could be that if a directory is being
traversed and the next path component is a symbolic link found in
that directory, then the traversal is allowed if either:

* the owner of the link and owner of the directory are the same UID; or

* the owner of the link is the same as the effective UID of the caller.

This is applied regardless of the permissions on the directory.
Otherwise the traversal is denied with EPERM. All other existing
considerations for the use of the symlink continue to apply also.

Essentially, a directory in a "nolink" mounted FS has to "vouch for"
a child symlink by having the same security attribute with regard to
ownership, or else the symlink has to be the caller's own item.





Re: cp, ln, mv, install: check for vulnerable target directories

2017-09-21 Thread Kaz Kylheku (Coreutils)

On 21.09.2017 09:18, Kaz Kylheku (Coreutils) wrote:

On 20.09.2017 18:59, Paul Eggert wrote:

Kaz Kylheku (Coreutils) wrote:


Instead of checking for what *could* go wrong, why not defend more
specifically against signs that the attack might be actually 
happening.


That's what the patch is trying to do, though it looks like it should
be improved.


There is a simple operating system fix for this: do not allow processes
to create symlinks in directories to which they only have write 
accesses via

S_IWOTH.


Two additional notes:

Rather than a hard-coded behavior, this could be a "nolink" mount 
option,

somewhat analogous to "nodev" (deny use of device nodes present in the
filesystem).

The permission denial would have to apply, of course, not only when a
new symlink is created via the symlink system call, but also to:

* an attempt to move an existing symlink into a directory where the
  caller has write permission only via S_IWOTH. (The rename system call
  has to check and enforce this).

* an attempt to duplicate a symlink into a directory via hard linking.
  (The link system call has to check and enforce).

* any other situation: overlaid directories? (In consideration of 
whether

  a malicious symlink could be perpetrated in situations in which
  a shared directory is formed by overlaying via unionfs, overlayfs
  and their ilk, and the attacker is able to create symlinks in some of
  the underlying directories even though such an attempt is blocked in
  the assembled directory.)








Re: cp, ln, mv, install: check for vulnerable target directories

2017-09-21 Thread Kaz Kylheku (Coreutils)

On 20.09.2017 18:59, Paul Eggert wrote:

Kaz Kylheku (Coreutils) wrote:


Instead of checking for what *could* go wrong, why not defend more
specifically against signs that the attack might be actually 
happening.


That's what the patch is trying to do, though it looks like it should
be improved.


There is a simple operating system fix for this: do not allow processes
to create symlinks in directories to which they only have write accesses 
via

S_IWOTH.

More precisely, the proposal is that if a process want to create a 
symlink,
then it either has to be root, or else the owner of the directory with 
S_IWUSR

asserted on the directory, or else the group owner (directly or via a
supplementary GID) with S_IWGRP asserted.  For the purposes of creating 
a

symlink, the directory is treated as if S_IWOTH is false, even if set.

The main use case for shared writable directories is /tmp and "spool" 
directories.


I can't think of a legit reason to be creating symlinks in those 
directories,

only subdirectories (in which the creator then make symlinks),
regular files, and some special objects like sockets.

A symlink in a shared writable directory is nothing more than a "name 
squatting"
trap. Ergo, don't allow that.  Or else, the responsibility for defense 
then

spreads all over the system, such as into basic utilities!





Re: cp, ln, mv, install: check for vulnerable target directories

2017-09-20 Thread Kaz Kylheku (Coreutils)

On 19.09.2017 00:25, Paul Eggert wrote:

For years cp and friends have been subject to a symlink attack, in
that seemingly-ordinary commands like 'cp a b' can overwrite arbitrary
directories that the user has access to, if b's parent directory is
world-writable and is not sticky and is manipulated by a malicious
user.


Also, it occurs to me that the attack can be perpetrated if any of the
ancestral directories are writable to another non-root user.

Suppose we have

   cp passwd /alpha/beta/gamma/delta/omega

If the attacker can write to alpha, the attacker can create a symlink in 
a path like this:


   /home/attacker/beta/gamma/delta/omega -> 

and, having write access to /alpha, the attacker can replace the 
/alpha/beta directory with

this symlink:

   /alpha/beta -> /home/attacker/beta





Re: cp, ln, mv, install: check for vulnerable target directories

2017-09-20 Thread Kaz Kylheku (Coreutils)

On 19.09.2017 00:25, Paul Eggert wrote:

For years cp and friends have been subject to a symlink attack, in
that seemingly-ordinary commands like 'cp a b' can overwrite arbitrary
directories that the user has access to, if b's parent directory is
world-writable and is not sticky and is manipulated by a malicious
user.


From patch:

PE> +environment variable.)  For example, if @file{/tmp/risky/d} is a
PE> +directory whose parent @file{/tmp/risky} is is world-writable and 
is

PE> +not sticky, the command @samp{cp passwd /tmp/risky/d} fails with
PE> +a diagnostic reporting a vulnerable target directory, as an 
attacker

PE> +could replace @file{/tmp/risky/d} by a symbolic link to a victim
PE> +directory while @command{cp} is running.  In this example, you can
PE> +suppress the heuristic by issuing one of the following shell 
commands

PE> +instead:

Instead of checking for what *could* go wrong, why not defend more
specifically against signs that the attack might be actually happening.

Somehow detect, "Uh oh! Parent is writable by another non-root user, and
the last component opened through a symlink!" while carefully guarding
against race conditions that could render such a defense tactic less 
than

fully effective.




RE: How to submit my utility for inclusion in coreutils?

2017-09-14 Thread Kaz Kylheku (Coreutils)

On 02.09.2017 07:29, Quiroz, Hector wrote:

Thanks for your reply.
I will write it in c then I will ask again.
Thanks


You would be doing something pretty silly: taking a working
Perl script which performs adequately and rewriting
it in C just so that it can potentially be included in some
project that requires everything to be written in C.

It's coding to solve a political/ideological issue, not a
technical one.




Re: New utility suggestion: chdir(1)

2017-08-27 Thread Kaz Kylheku (Coreutils)

On 26.08.2017 11:10, Colin Watson wrote:

I would like there to be an adverbial version of "cd", which takes a
path followed by a command and optional arguments and executes the
command with its working directory set to the given path.  Its
invocation would be similar to chroot(8), that is:

  chdir [OPTION] NEWDIR [COMMAND [ARG]...]


Could be an option in "env".

  env -C /path/to/dir VAR=value ... command arg

(-C follows "tar -C' and "make -C").




Re: New utility suggestion: chdir(1)

2017-08-26 Thread Kaz Kylheku (Coreutils)

On 26.08.2017 11:10, Colin Watson wrote:

  sudo chroot /path/to/chroot sh -c 'cd /foo && ls -l'


The -c option is not the only way to pass a script to the shell.

You can also pipe it in.

This means dealing with shell quoting, which is tedious and 
error-prone.


   sh <<'end'
   echo 'hello, world'
   end




Re: sort -V behaviour

2017-07-31 Thread Kaz Kylheku (Coreutils)

On 31.07.2017 09:23, Sven C. Dack wrote:

Hello,

I have a question about the -V option to sort, but first some examples:

$ echo -e "1\n1.2\n1.2.3\n1.2.3.4"|sort -V
1
1.2
1.2.3
1.2.3.4

$ echo -e "f1\nf1.2\nf1.2.3\nf1.2.3.4"|sort -V
f1
f1.2
f1.2.3
f1.2.3.4

$ echo -e "/1\n/1.2\n/1.2.3\n/1.2.3.4"|sort -V
/1
/1.2
/1.2.3
/1.2.3.4

$ echo -e "1f\n1.2f\n1.2.3f\n1.2.3.4f"|sort -V
1f
1.2f
1.2.3f
1.2.3.4f


Note that this also has a problem, though the behavior is what you
expect, so you don't notice.

Here, only the last three lines of input contain version numbers.

In each one, the last dot and everything after it is considered
a suffix; the versions being sorted are "1", "1.2" and "1.2.3".



$ echo -e "1/\n1.2/\n1.2.3/\n1.2.3.4/"|sort -V
1.2.3.4/
1.2.3/
1.2/
1/

My question is, why does the -V option reverse the order in the last 
case?



From the info documentation:

 Version-sorted strings are compared such that if VER1 and VER2 are
  version numbers and PREFIX and SUFFIX (SUFFIX matching the regular
  expression `(\.[A-Za-z~][A-Za-z0-9~]*)*') are strings then VER1 < VER2
  implies that the name composed of "PREFIX VER1 SUFFIX" sorts before
  "PREFIX VER2 SUFFIX".

Looks the SUFFIX regex doesn't match, so these names are not treated
as version names. It doesn't match because of the trailing slash.

If the trailing slash were included in the suffix match, there would
still be the problem that .4/, .3/ and ./2 are a the suffix,
and the version numbers are "1.2.3", "1.2", and "1", with the last
"1" being a non-version-number input.

Also this is noted:

 This functionality is implemented using gnulib's `filevercmp'
  function, which has some caveats worth noting.
  [...]
 * Some suffixes will not be matched by the regular expression
   mentioned above.  Consequently these examples may not sort as you
   expect:

abc-1.2.3.4.7z
abc-1.2.3.7z

abc-1.2.3.4.x86_64.rpm
abc-1.2.3.x86_64.rpm

Oops! And as you can see from these examples it is tricky.
Sometimes suffixes contain numeric stuff, which is why it's
specified that way.

Here the .7z files don't match the requirement for treatment as
version numbers, because the suffix, following the required period,
must begin with a letter or tilde.


This behaviour is unintuitive and seems wrong to me.


I agree that the specification is not ideal, but it's not easy to
see how it can be improved given the threat of numeric junk like 7z
which cannot be treated as part of the version.

Consider that 1.7z looks like a bigger version than 1.2.7z,
if the 7 is wrongly treated as part of the version!!!

The designers who specified the filevercmp function were clearly
sober to these cases.



Re: How the ls command interpret this?

2017-07-30 Thread Kaz Kylheku (Coreutils)

On 30.07.2017 13:46, Reuti wrote:

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1


Is that necessary?

What's the use of reducing your plausible deniability of rubbish 
postings?



Am 03.07.2017 um 20:30 schrieb BaBa:


Le 2017-07-03 19:17, Eric Blake a écrit :

Case 3:
$ mkdir foo
$ cd foo
$ touch a[b]# the glob doesn't match, so it is passed unchanged
$ ls ?[b]   # the glob doesn't match, so it is passed unchanged
ls: cannot access '?[b]': No such file or directory
$ cd ..
$ rm -rf foo


Yes understood, the glob didn't match.


And if globbing fails, it won't try it without.


Try what without what?

If globbing fails, then what happens is that the
unexpanded glob pattern remains. (That's the POSIX shell behavior;
GNU Bash has a "nullglob" flag which causes non-matching globs to
expand to nothing, and the GNU C Library's glob() function has
a similar flag.)


I mean, it could use
?'[b]' and try again which would succeed.


Why would globbing re-try with random permutations of the pattern
syntax?

(Why not also '?'[b]? Why not ?'['b]?)

Such a ridiculous complication of its specification of behavior
wouldn't help anyone; only lay traps for the unwary script writer.




Re: coreutils feature requests?

2017-07-19 Thread Kaz Kylheku (Coreutils)

On 19.07.2017 10:30, Eric Blake wrote:

On 07/19/2017 12:03 PM, Lance E Sloan wrote:

Hi, Eric.

Thank you for the thoughtful response.  I regret that I have trouble
understanding your point of view, though.  Please know that I do not 
mean
any disrespect.  I'd appreciate it if you could explain why you're 
opposed

to adding new features to cut (or to comm).


I'm not opposed to well-justified new features.  It's just that the bar
for justifying new features is rather high (it's a lot of code to add 
to

get field reordering, and that code has to be tested; it is also a
question of how many users will rely on that extension.  A testsuite
addition is mandatory as part of adding the new feature, if the new
feature is worth adding).


It is nontrivial code. For instance if we look at how the function
cut_bytes works in the implementation, what it's doing is simply
doing a getchar() from the stream, and querying a data structure
to determine whether the byte should be printed or not.
(That data structure consists of a pointer which marches through
field range descriptors in parallel with going through the data.)

cut_fields is more complicated due to the delimiting of fields,
but essentially the same overall approach.

Basically, printing of fields that isn't sorted and de-duplicated
is a rewrite of all parts of the utility other than command
line processing and the printing of usage help text.

It's like two different programs in one, sharing a minimal
skeleton.




Re: coreutils feature requests?

2017-07-19 Thread Kaz Kylheku (Coreutils)

On 19.07.2017 10:03, Lance E Sloan wrote:
With regard to your objection to a special environment variable: I 
agree.

I didn't feel strongly about it at first, but I was leaning towards not
implementing env. var. support for this.  It just didn't feel right.  I
have written programs that use env. var. to specify or override default
options.  However, your point about adding this to an existing, 
established
program where it could possibly cause variable conflicts is enough to 
set

my option a little more strongly against an env. var. for this purpose.


My own objection to such influencing env-vars is rooted in
"global variables are bad"; i.e. the inherited wisdom from decades
of computer science and software engineering practice.

Say you want to combine code from two or more scripts, all of which
want to configure the same tools in different ways via globals. Now
you have a mess of saving and restoring the global so that it
has the value that each bit of code expects when that code is running.

With magic globals we can't look at a line of code and know exactly
what it will do, without reasoning about what are the current values
of the globals. That reasoning may be intractable: i.e. situations in
which it simply cannot be known. Then the code has to ensure that the
variables have certain values by assigning them. Except, it can't
just assign because that will influence some other code; it has to
save the prior values, then assign, then restore when done.

Of course environment variables can be scoped:

   OPT=value command

but that's local syntax now, which defeats the purpose of the
"action at a distance" effect of the variable; it might as well be
transliterated to:

   command --opt=value

and there we are.






RE: coreutils feature requests?

2017-07-19 Thread Kaz Kylheku (Coreutils)

On 19.07.2017 06:29, Nellis, Kenneth wrote:

From: Steeve McCauley
I can't believe I'd never thought of reordering output columns like 
this. 
FWIW, I agree that another option should be used to prevent issues 
with backward compatibility.


$ echo 1,2,3,4,5,6 | cut -d, -f3,5,2
2,3,5

$ echo 1,2,3,4,5,6 | cut -d, -f3,5,2 -o
3,5,2
Should this be extended to character output as well?
echo output | cut -c6,4,2 -o
tpu



Absolutely! It would be expected behavior (IMHO). I see no reason not 
to.


POSIX expends considerable text in requiring that the fields constitute
a set, and are de-duplicated and put into order. (Quoted in earlier
message.) So indeed, the behavior cannot just be changed to match the
QNX "cut", not just because of backward compatibility (always the 
primary

concern) but standard conformance also (close second).

QNX has a conformance bug here. I wouldn't continue to rely on it.

In addition, so that scripts can work across platforms, I (strongly) 
recommend
that a cut-specific environment variable be defined to allow specifying 
the
field ordering behavior. In that way my QNX 4 script (whose cut would 
balk at

the -o option) would work with Gnu. One possibility:


On the other hand, I hope I'm not alone in behind opposed to introducing
new global variables which alter language or tool semantics.

Scripts can target multiple implementations of a command by defining a
wrapping function rather than a magic global variable.

A portable script cannot rely on cut having order-preserving fields 
anyway.


Everyone in POSIX-land implementing different features and then using
a zoo of environment variables to emulate each other's features and
quirks would be unmanageable.




Re: coreutils feature requests?

2017-07-18 Thread Kaz Kylheku (Coreutils)

On 18.07.2017 15:44, Lance E Sloan wrote:

Hi, all.

Aside from a bug report, is there a way to submit a feature request for
coreutils?  I have a couple of requests in mind, which I don't think
qualify as bugs:

1. Add a feature to "cut" that allows selected fields to be output in a
different order.  That is "cut -f4,1,8-6" would cause it to output 
fields

in the order of 4, 1, 8, 7, and 6.


I'm amazed that it doesn't work this way; the utility of implicitly
sorting the fields appears low compared to the damage that it does
to the flexibility of cut. (What little it has!)

If POSIX specifies it, I have to say that its requirements are
suboptimal (as is often the case in diverse areas).

Indeed, the requirement is sadly given as:

   The elements in list can be repeated, can overlap, and can be 
specified

   in any order, but the bytes, characters, or fields selected shall be
   written in the order of the input data. If an element appears in the
   selection list more than once, it shall be written exactly once.

Do people often write cut specifications in ad hoc orders, with
repetitions and then rely on the sorting behavior?

The GNU implementation of cut could lead standardization in this
area to improve things.

All the possible solutions to make cut not sort the fields or
de-duplicate them are ugly. Either you  need a global option to
opt out of that behavior, like "-o" (preserve (o)rder) and
keep remembering to use it, or else provide order-preserving
versions of the various options, like perhaps through capitalization:
-F, -C, -B.

An attractive alternative is to have a whole new command which
mirrors cut, like say "clip", which is exactly like cut, but
order and repetition preserving.

Note that the --complement option is semantically incompatible with
order-preserving mode; the complement concept follows from the
selected elements being regarded as a set rather than ordered
sequence. If -F, -C or -B is used with --complement, it has to be
diagnosed.  Or if there is a "clip" command, then that simply
doesn't support --complement.



Re: Determination of file lists from selected folders without returning directory names

2017-07-18 Thread Kaz Kylheku (Coreutils)

On 18.07.2017 01:17, SF Markus Elfring wrote:
I imagine that there are more advanced possibilities to improve the 
software

run time characteristics for this use case.


Well, if it has to be fast, perhaps don't write the code in the shell 
language.


To which “shell” would you like to refer to?


The "Shell Command Language", called by that name in POSIX, and to its
dialects.

Even an interpreted scripting language that can do string handling 
without
resorting to fork()-based command substitution will beat the shell at 
many tasks.


How do you think about additional approaches to reduce the forking of
special processes?


I think: don't do text processing whose speed matters in a language 
where
you have to even think about the issue "how do I reduce fork() 
occurrences

in string processing code" and in which you don't even know whether
some command involves a fork or is a built-in primitive.

If you've resigned to developing something in the shell, and that 
something

has to process many items of data, try not to write a shell loop for the
task, and try to avoid idioms which run a process for each item.
Rather, coordinate commands which do the heavy lifting.

If I had to strip a large number of paths to their basenames, and it had
to be done in portable shell code, I would filter those names through 
sed:

one process invocations and some pipe inter-process I/O.


I.e. we can use the basename function:

  for name in dir/*txt; do
basename "$name"
  done

prints the basenames of the matching files, one per line.


There is also the GNU variant available for such a command.

   for X in $(basename --suffix=.txt dir/*txt); do my_work $X; done


What you're doing here is destroying the validity of these expanded
paths; the "my_work" command or function cannot access things
through these paths, unless it restores the "dir/" prefix,
which it has not been given as an input.

When you expand dir/*txt, each one of the expansions is a correct
relative path to an object. The stripped basenames aren't.

Whatever "my_work" is doing, if it involves accessing the files,
you're probably making its job more difficult.


But how often can it be avoided to delete extra data like prefixes
(and suffixes)?


Pretty much all of the time.

Can it occasionally be a bit more efficient to provide only the 
essential

values at the beginning of an algorithm so that so they will be
extended on demand?


That sounds like a generic description of the whole body of "lazy"
or "late binding" techniques; but it's unclear how it is supposed
to apply here.

Maybe "my_work" could be given relative paths that resolve; if it needs
shortened names for some reason, let it compute them.

Or "my_work" could be given a quoted pattern:

   my_work '*.txt'

then it can expand it as needed, in whatever directory it wants.





Re: Determination of file lists from selected folders without returning directory names

2017-07-17 Thread Kaz Kylheku (Coreutils)

On 17.07.2017 12:37, SF Markus Elfring wrote:
A corresponding result could be achieved by using a subshell (which I 
would
like to avoid for this use case) for a command like “(cd ${my_dir} && 
ls *txt)”.


If you want to capture these strings into a variable, you can't really
avoid a sub-process.


I looked at a programming interface like the function “opendir”.
I imagine that there are more advanced possibilities to improve the 
software

run time characteristics for this use case.


Well, if it has to be fast, perhaps don't write the code in the shell 
language.


Even an interpreted scripting language that can do string handling 
without
resorting to fork()-based command substitution will beat the shell at 
many

tasks.


it can be done like this:

   for name in dir/*txt ; do
 echo ${name#dir/}
   done


I would like to avoid such an operation “Remove matching prefix
pattern” generally.
If the desired file lists contain only basenames, extra prefixes do not 
need

to be deleted.


I.e. we can use the basename function:

  for name in dir/*txt; do
basename "$name"
  done

prints the basenames of the matching files, one per line.



Re: Determination of file lists from selected folders without returning directory names

2017-07-17 Thread Kaz Kylheku (Coreutils)

On 17.07.2017 10:25, SF Markus Elfring wrote:

Hello,

The tool “ls” supports the filtering of directory contents.


No, it actually doesn't. In a shell command like

  ls *.txt

it is actually the shell which performs the *.txt filename expansion,
before the ls program is executed. That program receives all of the
file names as individual arguments, rather than the original pattern.

If you just want the names themselves, you can simply do:

  echo *.txt

One advantage of this is that it avoids the limitations on the
size of the argument vector that can be passed to a child process,
because echo is built-in (if we're talking about GNU Bash, which
is reasonable, given that we're in the GNU Coreutils list.)

Iterating over the matching names is possible using the for syntax:

  for x in *.txt ; do
commands ...
  done

That also avoids limitations on argument passing since it is all
built-in syntax.


A corresponding result could be achieved by using a subshell (which I 
would
like to avoid for this use case) for a command like “(cd ${my_dir} && 
ls *txt)”.


If you want to capture these strings into a variable, you can't really
avoid a sub-process. For instance simply doing:

   names=$(echo *txt)

involves a sub-process for the command substitution. Since it is echo,
it could be optimized. I just checked though; bash 4.3.48 is forking
a child process for this. So there is hardly much additional 
disadvantage

from doing:

   names=$(cd; echo *txt)

that just adds a chdir() system call to the child script.

If the goal is to just dump the names on standard output without the
directory prefix, it can be done like this:

   for name in dir/*txt ; do
 echo ${name#dir/}
   done

this involves no forking of a child process.

To get them on the same line as with echo:

   for name in dir/*txt ; do
 printf "%s " "$name"
   done
   echo # emit final newline




Re: null separated ls output option

2017-06-29 Thread Kaz Kylheku (Coreutils)

On 28.06.2017 23:31, Bernhard Voelker wrote:

[adding findutils]
First of all, find(1) is maintained in the GNU findutils project,
not in GNU coreutils.

Redirected from:
  http://lists.gnu.org/archive/html/coreutils/2017-06/msg00049.html

On 06/28/2017 07:13 PM, Kaz Kylheku (Coreutils) wrote:
> [ snip ... my elaborate proposal for a find -sort predicate ]

I think the GNU toolbox already gives you enough flexibility to support
these edgy use cases, e.g. sorting by file size:

  find . -printf "%s/%p\0" \
| sort -zt '/' -k1,1n  \
| sed -z 's,^.*/,,'


That is great. So there is nothing do here, basically; all the better.

No null-terminated patch for ls is required; GNU sort can sort the
specially formatted null-terminated output from find.

See; it pays to have a discussion about the requirements
before whipping up code.




Re: null separated ls output option

2017-06-28 Thread Kaz Kylheku (Coreutils)

On 28.06.2017 06:53, ra...@openmailbox.org wrote:

On 2017-06-01 04:45, Pádraig Brady wrote:

On 31/05/17 15:24, ra...@openmailbox.org wrote:
Existing tools like find(1) were thought sufficient


but find does not support sorting by date which ls does.

I hope this patch can be reconsidered for inclusion.


Rather than the obvious: patching find so that it supports sorting the 
paths

by date, or other properties?

This could be a "-sort " predicate, which is understood to be like
"-print", except it sends the visited node into a sorting bucket, which
is spilled when find finishes executing.  Sort buckets are identified by
the "" syntax as a key, so multiple occurrences of the predicate
giving the same  go to the same bucket. Multiple occurrences of 
-sort

with different keys route to different buckets; these buckets can be
later dumped in left to right order, based on the position of the
leftmost predicate which specifies each bucket.

 could use + and - as prefixes for ascending and descending 
(defaulting
to + if omitted) followed by a word which is derived from the space of 
predicates:

atime, ctime, name, iname, ...

Comma separation for compound keys? -sort mtime,name

Something like that.



Re: [PATCH] env: support encoding of args into command.

2017-05-29 Thread Kaz Kylheku (Coreutils)

On 29.05.2017 04:29, Eric Blake wrote:

On 05/27/2017 07:30 PM, Kaz Kylheku (Coreutils) wrote:



Bascially I'm completely against almost every aspect of this
-S design; and I suspect the POSIX standardization people
(Austion Group) won't adopt it, either, so it will forever
remain just a FreeBSD feature (and we can help keep it that
way by not copying it).


The Austin Group has already declared that #! is non-portable, and that
portable scripts can't use it, BECAUSE of the wide variety in how
kernels handle it and the small limits on how much you can cram in that
line.


Gentleman, please disregard the patch.

I don't care about it any more because I have discovered a
hack which makes it pointless.

With excellent language-level backward compatibility, a given
scripting language interpreter "interp" can provide support
for being invoked in the following manner:

  #!/usr/bin/env interp\000trailing material

Here \000 represents a literal embedded null byte.

So, of course, env receives arg[1] as "interp"
and finds the interpreter properly.  This is the case
whether the kernel stops reading the string after the null,
or wheter the kernel passes the character array
"interp\000trailing material" as argv[1] to env.
either way, env only sees "interp".

The interpreter can then open the script and read the full line,
look for the null byte, and give a meaning to "trailing material".

The interpreter can, in that space, implement the equivalent
of my argument delimiting approach, or the more elaborate one
taken in BSD's env -S.

The notation is very space efficient: just one delimiting character
which positively requires no escaping.

It doesn't require on adding a second line to the script for encoding
the material, which can change the meaning of existing scripts.

It also potentially defeats limitations on hash bang line size.
Why? Because the only requirement which has to be met is that the
null byte occurs within the header size limit! Not the entire hash
bang line.

The programmer is not relying on the hash bang mechanism to pass
anything after the null byte through the command line, so if any
of it is cut off, that is immaterial.

So far, I have tried this on Darwin, Linux, Solaris and Cygwin:
works fine!

A possible objection is that every interpreter has to implement its
own hack for recognizing the material after the null byte and
doing something. The solution for that, of course, is to provide
a library function for dealing with it: a function which takes
(argc, argv), and index of which argv[] is the script name,
and returns a transformed (argc, argv).

The thing to do is to develop develop that library function to make
it easy for interpreter writers to just "drop in".







Re: [PATCH] env: support encoding of args into command.

2017-05-27 Thread Kaz Kylheku (Coreutils)

On 27.05.2017 08:06, Pádraig Brady wrote:

Now the FreeBSD env(1) command supports the -S option
to do splitting much as you've proposed, so I see no need to be 
different.

Could you adjust the patch accordingly?


I think it's probably best to avoid copying FreeBSD here. They have
created a mess with too many requirements. Their -S feature
has quitespace delimiting, quoting, escaping of quotes,
C-like character escape sequences: \n, \r, \f, ...
and ${PARAM} environment variable expansion. A little shell
language seems to be brewing inside the FreeBSD env program.
What's it all for?

Yet, for all the "bells and whistles", they are missing the
{} feature to insert the first argument (the name of the script
in hash bang dispatch) among the generated arguments,
something which can't be achieved with any combination of
environment variable substitutions. It is useful because
it allows the hash bang to specify some arguments after the
script (which could be arguments belonging to the script
rather than to the interpreter, for instance).

Let's look at the delimiting requirement. Using spaces
perpetrates a clever visual ruse. The command line:

$ /usr/bin/env -S a b c

translates directly to hash bang:

#!/usr/bin/env -S a b c

the semantics is different, of course; here the string " a b c"
is one argument to -S, subject to splitting. It's
understandable why they have it this way, but I believe
it is fine not to play this sort of trick and make the
mechanism have a visually explicit syntax.

I chose the colon character for a very good reason: it is the
PATH separator. Which means, it doesn't occur the command
argument in any correct usage of env, and is therefore available
as a separator.

I don't envision a need to support quoting the : character
for inclusion in an argument.

The main reason to include : in an argument is to support
the -P altpath feature of env, another FreeBSD extension.
This is also overdesigned.  If you don't know
where the program is, rely on PATH. If you know where
the program is, put its exact path into the hash bang
script, and don't use the env utility.
The situation "I don't know where the software might be
installed, but it's in one of these
several locations which are not in the PATH" is fairly contrived,
except in one way: when it is known that the software is in
one of the secure, default system locations for programs such
as /bin:/usr/bin, but the PATH could have been altered not
to look in these locations first.

Instead of -P altpath, a single letter option with no argument
can indicate that PATH is to be reset to the secure default,
so that the correct program is found, or else env fails.

Bascially I'm completely against almost every aspect of this
-S design; and I suspect the POSIX standardization people
(Austion Group) won't adopt it, either, so it will forever
remain just a FreeBSD feature (and we can help keep it that
way by not copying it).




Re: [PATCH] env: support encoding of args into command.

2017-05-25 Thread Kaz Kylheku (Coreutils)

On 24.05.2017 18:10, Kaz Kylheku (Coreutils) wrote:
This is a new feature which allows the command argument of env to 
encode

multiple extra arguments, as well as the relocation of the first
trailing argument among those arguments.


Looks like my MUA screwed this up with "format=flowed" and
quoted printable.

I will re-send using the "mail" utility.




[PATCH] env: support encoding of args into command.

2017-05-25 Thread Kaz Kylheku (Coreutils)

This is a new feature which allows the command argument of env to encode
multiple extra arguments, as well as the relocation of the first
trailing argument among those arguments.

* src/env.c (usage): Mention the existence of the feature.
(expand_command_notation): New function.
(main): Detect whether the notation is present, based on the first
character of command. If so, filter the trailing part of the argument
vector through the expand_command_notation function, and use that.
Either way, the effective vector is referenced using the down_argv
variable and that is used for the execvp call.
If an error occurs, the diagnostic refers to the first element of
down_argv rather than the original argv.

* tests/misc/env.sh: Added some test cases. Doesn't probe all the corner
cases. I solemnly declare that I manually tested those corner cases,
like "env :" and "env :{}" and such, and used valgrind for
all the manual testing to be confident that there are no
overruns or uses of uninitialized bytes.

* doc/coreutils.texi: Documented feature. Added discussion about how
env is often used for the hash bang mechanism, and how the feature
relates to this use.
---
 doc/coreutils.texi | 63 
+
 src/env.c  | 64 
--

 tests/misc/env.sh  | 18 +++
 3 files changed, 143 insertions(+), 2 deletions(-)

diff --git a/doc/coreutils.texi b/doc/coreutils.texi
index 1834e92..9e1cb0c 100644
--- a/doc/coreutils.texi
+++ b/doc/coreutils.texi
@@ -16879,6 +16879,69 @@ env -u EDITOR PATH=/energy -- e=mc2 bar baz

 @end itemize

+Note that the ability to run commands in a modified environment is 
built into
+the shell language, using a very similar 
@samp{@var{variable}=@var{value}}
+syntax; moreover, that syntax allows commands internal to the shell to 
be run

+in a modified environment, which is not possible with the external
+@command{env}.  Other scripting languages usually also have their own 
built-in
+mechanisms for manipulating the environment around the execution of a 
child
+program.  Therefore the external @command{env} executable is rarely 
needed for
+the purpose of running a command in a modified environment.  Because 
the
+@command{env} utility uses @env{PATH} to search for @var{command}, it 
has come
+to be mainly used as a mechanism in "hash bang" scripting. In this 
usage,
+scripts are written using the incantation @samp{#!/usr/bin/env interp} 
where

+@var{interp} is the name of some scripting language interpreter. The
+@command{env} utility provides value by searching @env{PATH} for the 
location
+of the interpreter executable. This allows the interpreter to be 
installed in
+some chosen location, without that location having to be edited into 
the hash

+bang scripts which refer to that interpreter.
+
+On some operating systems, the following issue exists: the hash bang
+interpreter mechanism allows only one argument. Therefore, if the 
@command{env}
+incantation @samp{#!/usr/bin/env interp} is used, it is not possible to 
pass an

+argument to @samp{interp}, which is a crippling limitation in some
+circumstances requiring clumsy workarounds. To overcome this 
difficulty, the

+GNU Coreutils version of @command{env} supports a special notation:
+arguments for @var{command} can be embedded in the @var{command} 
argument

+itself as follows.  If @var{command} begins with the @samp{:} (colon)
+character, then that colon character is removed. The remainder of the
+argument is treated as record of colon-separated fields, and split
+accordingly. For instance if @var{command} is @samp{:foo:--bar:42}, 
then

+it is split into the fields @samp{foo}, @samp{--bar} and @samp{42}. The
+effective command is then just @samp{foo}. The other two fields will be
+passed as the first two arguments to @samp{foo}, inserted before the
+remaining @var{args}, if @samp{foo} is successfully found using
+@env{PATH} and executed.
+Furthermore, this special supports one more refinement.
+If, after colon splitting, one or more of the fields are
+equal to the character string @samp{@{@}} (open brace, closed brace)
+then the leftmost such field is replaced with the first of the 
@var{args}

+which follow @var{command}. In this case, that argument is removed from
+@var{args}. If @var{args} is empty, then the field is not replaced.
+
+Example: @command{env} hash bang line for a script executed by the
+fictitious @samp{intercal} interpreter. The @samp{--strict-iso} option
+is passed to the interpreter, and the @samp{--verbose} option is
+passed to the script:
+
+@example
+#!/usr/bin/env :intercal:--strict-iso:@{@}:--verbose
+... script goes here ...
+@end example
+
+When the above hash bang script is invoked with the arguments 
@samp{alpha} and
+@samp{omega}, @command{env} is invoked with four arguments arguments: 
the

+argument @samp{:intercal:--strict-iso:@{@}:--verbose}, followed by the
+path name to the above script itself, follo