Re: thoughts on NO_COLOR
On 2022-02-27 12:37, Pádraig Brady wrote: I just noticed some de facto treatment of the NO_COLOR env var. https://no-color.org/ These people are not system implementors; they should not be proposing variables in a POSIX-reserved namespace. The website provides no contact links whatsoever; they have walled themselves in Github, and take pull requests as the only means of communication. This is not the way to do things if you want to promote a standard, or to engage with the free software world. I'm instantly opposed to this NO_COLOR on the above grounds, and think it's an excellent idea to show opposition by making programs call abort() when they find NO_COLOR to be set. However, I do think it's a great idea for users who don't want color to have a single place to turn it off. I was considering having ls --color=auto honor this, but then thought it is not actually needed in ls, since we give fine grained control over the colors / styles used. For example one might very well always want at least some distinguishing of files and directories, with bold / bright etc. which can be achieved now with LS_COLORS. Or looking at it another way, ls is ubiquitous enough that it's probably already color configured as the user desires, ls is almost certainly color configured the way the user's distro desires, at least initially. Clearly, the variable is aimed at people who don't find things configured as they want by default, and who don't want to go into individual programs or distro scripts to do that, but only flip a single master switch. and having ls honor the less fine grained NO_COLOR flag, would result in less flexibility. More knobs to tweak can never result in less flexibility. Were you thinking of "perplexity"? ;) Speaking of which, I can see someone pulling their hair out trying to get color working, not realizing that something set a NO_COLOR environment variable somewhere.
Re: how to speed up sort for partially sorted input?
On 2021-08-11 11:58, Peng Yu wrote: On Wed, Aug 11, 2021 at 1:43 PM Kaz Kylheku (Coreutils) <962-396-1...@kylheku.com> wrote: On 2021-08-11 05:03, Peng Yu wrote: > On Wed, Aug 11, 2021 at 5:29 AM Carl Edquist > wrote: >> (With just a bit more work, you can do all your sorting in a single >> awk >> process too (without piping out to sort), but i think you'll still be >> disappointed with the performance compared to a single sort command.) > > Yes, this involves many calls of the coreuils' sort, which is not ^^^ No, not this last remark, which is about "in a single awk process". I know there is one awk process. I don't understand why you mentioned it. (That's why.) > efficient. Would it make sense to add an option in sort so that sort > can sort a partially sorted input in one shot. IF you're willing to use GNU Coreutils instead of Unix, you probably have I don't think using awk is efficient. I am program a number awk programs for simple transforming the input and tested it, in general, it is slower than the equivalent python code, let along C code. You can talk about doing most of the work in awk below. I don't think that make sense. Having coreutils' sort be able to do a partial sort is a more reasonable solution. The solution doesn't exist today, whereas that Gawk program should run even in ten year old installations. For the solution to be useful, it only has to beat the actual sort which you have available today, not some imagined version of sort that isn't yet available. I'm assuming that you're posting here because you have some real problem to solve, not just to postulate chrome plating for Coreutils, and so that a working program today would be of use to you. A vast amount of useful computing is being done with tools and approaches that are not thoroughly optimized. Sometimes those approaches usefully prototype a solution which is later optimized or replaced; in the meantime, that solution serves a useful purpose.
Re: how to speed up sort for partially sorted input?
On 2021-08-11 05:03, Peng Yu wrote: On Wed, Aug 11, 2021 at 5:29 AM Carl Edquist wrote: (With just a bit more work, you can do all your sorting in a single awk process too (without piping out to sort), but i think you'll still be disappointed with the performance compared to a single sort command.) Yes, this involves many calls of the coreuils' sort, which is not No, not this last remark, which is about "in a single awk process". efficient. Would it make sense to add an option in sort so that sort can sort a partially sorted input in one shot. IF you're willing to use GNU Coreutils instead of Unix, you probably have GNU Awk also. GNU Awk has a sorting function using which a solution could be cobbed together. Maybe something like: function dump_delete_data() { n = asorti(data, idx); for (i = 1; i <= n; i++) print data[idx[i]]; delete data serial = 0 } BEGIN { serial = 0 } $1 != prev_1{ dump_delete_data() } NF >= 2 { prev_1 = $1 data[$2 "." serial++] = $0 next } 1 { dump_delete_data() print } END { dump_delete_data() } The asorti function has some features behind it to sort in various ways; you have to look into that. It involves manipulating a PROCINFO["sorted_in"] value. It's possible to use a custom comparison function. For more info, see GNU Awk documentation, the Gawk mailing list or the comp.lang.awk newsgroup. The purpose of the serial variable in my above code so that we get two entries in data[] if in a given group, there are identical $2 values. For instance if $2 is "foo", then the key we use is actually "foo.3" if the current value of serial is 3. The sorting is then done on these suffixed keys, which works okay for lexicographic sorting. It is not a stable sort, though! Because foo.123 will be sorted before foo.23, even though the 123 serial value comes later. If we padded the integer with enough leading zeros for the largest possible group, it would then be stable: foo.00023 would come before foo.00123: data[sprintf("%s.%08X", $2, serial++)] = $0 kind of thing. If you don't care about reproducing duplicates, you can remove this logic entirely. How the overall program works is that data[] is an array indexed on the second column values (plus serial suffixes). The value of each index value is the entire record, $0. asorti sorts the $2 indices, throwing away the $0 values, which is why we direct it into a secondary array called idx, preserving the data array. The idx array ends up indexed on integer values 1 to N, where N is the chunk size. If we iterate over these values, idx[i] gives us the $2 column values (with serial suffix) in sorted order. We can then use that as the key into data[] to get the corresponding records in sorted order. Cheers ...
Re: how to speed up sort for partially sorted input?
On 2021-08-10 22:06, Kaz Kylheku (Coreutils) wrote: On 2021-08-07 17:46, Peng Yu wrote: Hi, Suppose that I want to sort an input by column 1 and column 2 (column 1 is of a higher priority than column 2). The input is already sorted by column1. Is there a way to speed up the sort (compared with not knowing column 1 is already sorted)? Thanks. Since you know that colum 1 is sorted, it means that a sequential scan of the data will reveal chunks that have the same colum1 value. You just have to read and separate these chunks, and sort each one individually by column 2. GNU Awk has the wherewithal for this sort of thing; it has some facilities for sorting associative arrays. TXR Lisp: structure + function + awk macro: (defstruct rec () f1 rec) ;; sort list of records by slot f1, and then ;; dump their rec slots in order via put-line. (defun dump (data) (mapdo [chain .rec put-line] (nsort data : .f1))) (awk ;; two local variables (:let f0-prev data) ;; if field zero is not same as f0-prev, sort and dump data, ;; then set data to nil again ((nequal [f 0] f0-prev) (dump data) (set data nil)) ;; if we have a second field, set f0-prev to that field, ;; then capture the record in a structure '' and push it on the list. Go to the next record. ([f 1] (set f0-prev [f 0]) (push (new rec f1 [f 1] rec rec) data) (next)) ;; we don't have a second field: just sort and ;; dump the accumulated data, and also print this record. (t (dump data) (prn)) ;; end of data: sort and dump accumulated data. (:end (dump data)))
Re: how to speed up sort for partially sorted input?
On 2021-08-07 17:46, Peng Yu wrote: Hi, Suppose that I want to sort an input by column 1 and column 2 (column 1 is of a higher priority than column 2). The input is already sorted by column1. Is there a way to speed up the sort (compared with not knowing column 1 is already sorted)? Thanks. Since you know that colum 1 is sorted, it means that a sequential scan of the data will reveal chunks that have the same colum1 value. You just have to read and separate these chunks, and sort each one individually by column 2. GNU Awk has the wherewithal for this sort of thing; it has some facilities for sorting associative arrays. You can scan records and aggregate them while column1 is the same, then do some sorting and output (also at the end of the file). Good luck!
Re: Suggest on "ln"
On 2021-07-19 00:50, Patrick Reader wrote: On 19/07/2021 08:48, Kamil Dudka wrote: On Monday, July 19, 2021 2:29:18 AM CEST James Lu wrote: "ln" should write a warning to stderr if the source file doesn't exist. ln writes an error message to stderr if the source file does not exist: $ mkdir new-dir $ cd new-dir $ ln does-not-exist target ln: failed to access 'does-not-exist': No such file or directory I'm guessing they meant `ln -s`. Symbolic links with nonexistent targets are legitimate and useful. They can be used to stash information that isn't intended to be a pointer to an object in the file system at all: ln -sf "" hash A symlink is essentially a tiny text file where you can store info, subject to some easy-to-meet restrictions. Dangling links can be prepared in a file system structure that will be installed somewhere, where the links will resolve: ln -sf /etc/alternatives/netcat $(DESTDIR)/bin/netcat An option to emit a warning could be mildly useful, but it's nothing you can't check yourself *after* the symlink is made: ln -sf $TARGET $LINK # quoting elided for brevity [ -e $LINK ] || printf "warning: link target %s doesn't exist" $TARGET Doing the check before the link is made is more involved because a relative link target is resolved relative to the link's location. Plus if it is buggy, then it won't match what the operating system says; the ultimate arbiter of whether the link is dangling is to create it and actually test it.
Re: [PATCH] copy: disallow copy_file_range() on Linux kernels before 5.3
On 2021-05-12 16:09, Pádraig Brady wrote: copy_file_range() before Linux kernel release 5.3 had many issues, Remark: although there is nothing wrong with the patch, and it is necessary, this seems like an issue for the C library to handle, as well. - The GNU C library provides the function copy_file_range. The fact that this is a linux kernel feature is abstracted by the library. - The GNU C library knows what version of the kernel it is running on, and provides workarounds in relation to that (or possibly refuses to run at all). So, arguably, the responsibility of somehow working around this lies with glibc; glibc is the purveyor of the API. Unfortunately, I don't see any discussions about the issue in libc-alpha mailing list. The situation is not acceptable that a GNU program is working around broken GNU library functions, with no action being taken in the GNU library (even if that program will need to carry those workarounds anyway). The most recent commits which mention copy_file_range in the commit message are: 2020-03-03 Florian Weimer Linux: copy_file_range syscall number is always available 2019-07-08 DJ Delorie NEWS: clarify copy_file_range 2019-06-28 Florian Weimer io: Remove copy_file_range emulation [BZ #24744] (I am obviously assuming that any commit related to this issue will have "copy_file_range" in the commit message.) The comment for that third one is interesting: The kernel is evolving this interface (e.g., removal of the restriction on cross-device copies), and keeping up with that is difficult. Applications which need the function should run kernels which support the system call instead of relying on the imperfect glibc emulation. Glibc should not be providing a function like this at all until it stabilizes. Programs wanting to use a bleeding edge kernel call should use a some syscall macro to generate it themselves. It sounds as if now that it may have stabilized, glibc should be offering it under a different name/alias. Programs that need a reliable copy_file_range can then just refer to that new name which indicates reliable semantics. They detect that name in their configure scripts, and build and link against that name. Those programs will then refuse to run against a glibc which doesn't export that name. Problem solved.
Re: [PATCH] ls: add --sort=width (-W) option to sort by filename width
On 2021-04-09 15:51, Pádraig Brady wrote: On 09/04/2021 13:02, Carl Edquist wrote: Dear Coreutils Maintainers, I'd like to introduce my favorite 'ls' option, '-W', which I have been enjoying using regularly over the last few years. The concept is just to sort filenames by their printed widths. (If this sounds odd, I invite you hear it out, try and see for yourself!) I am including a patch with my implementation and accompanying tests - as well as some sample output. And I'll happily field any requests for improvements. I quite like this. It seems useful. Also doing outside of ls is quite awkward, especially considering multi column output. Ah, but not so! What is awkward is doing the sorting outside of ls, using only the shell and utilities. The multi column output can be done by feeding the sorted list of files to ls, with the -df options (don't list directories, don't sort). Demo: ls -f | gawk -f sizesort.awk . buf.h time.c arith.h alloca.h y.outputgenman.txr .. txr.h hash.h txr.vim struct.h lex.yy.cgencadr.txr ID lib.c tree.c combi.c socket.h parser.hMETALICENSE tstbuf.c glob.c arith.c parser.y filter.creconfigure wintxr.c cadr.h chksums parser.l termios.h genvmop.txr txrffi.h eval.h sysif.c inst.nsi termios.c LICENSE-CYG optftw.h glob.h regex.h chksum.c linenoise config.make mpijmp.S args.c debug.h signal.c configure sizesort.awk tags ffi.c hash.c y.tab.h syslog.h protsym.c .gdb_history pack lib.h utf8.c tags.tl stream.c lisplib.h checkman.txr gc.c rand.c time.h regex.c unwind.c strudel.c genprotsym.txr gc.h args.h combi.h INSTALL itypes.c strudel.h y.tab.c.shipped vm.c rand.h debug.c match.c syslog.c gs_YEC3Hr y.tab.h.shipped vm.h utf8.h HACKING unwind.h chksum.h optand.tl txr-manpage.pdf .git cadr.c parser. filter.h signal.h gs_P4Z02S HACKING-toc.txr txr.1 tl.vim sysif.h itypes.h RELNOTES gs_8aK1VJ lex.yy.c.shipped share eval.c match.h struct.c parser.c gs_G7H2OA ChangeLog-2009-2015 tests tree.h LICENSE config.h socket.c lisplib.c ftw.c vmop.h y.tab.c Makefile stream.h genvim.txr Source code of sizesort.awk (which uses GNU Awk extensions): #!/usr/bin/awk -f function compare(ia, a, ib, b) { return length(a) - length(b) } { dir[NR] = $0 } END { asort(dir, sdir, "compare") for (x in sdir) { print sdir[x] | "xargs ls -fd" } } But this doesn't handle arbitrary file names. However, Awk can process null terminated/separated records, as put out by find -print0: Hold my beer: $ find . -print0 | awk -v RS='\0' '{print$1}' | head . ./rand.c ./args.h ./termios.h ./combi.h ./rand.h ./gencadr.txr ./unwind.h ./txr.1 ./termios.c Proof of concept with sorting: $ find . -maxdepth 1 -print0 | awk -rf sizesort0.awk ../txr.c ./args.c ./regex.c ./chksum.h ./gs_G7H2OA ./ID ./ffi.h ./hash.c ./INSTALL ./signal.h ./lisplib.c ./tst./ftw.h ./utf8.c ./match.c ./RELNOTES ./genvim.txr ./win./jmp.S ./time.h ./unwind.h ./parser.c ./genman.txr ./txr./ffi.c ./combi.h ./filter.h ./socket.c ./gencadr.txr ./opt./lib.h ./debug.c ./itypes.h ./stream.h ./METALICENSE ./mpi./rand.c ./HACKING ./struct.c ./y.output ./reconfigure ./tags ./args.h ./parser. ./config.h ./lex.yy.c ./genvmop.txr ./pack ./rand.h ./sysif.h ./Makefile ./parser.h ./LICENSE-CYG ./gc.c ./utf8.h ./match.h ./alloca.h ./filter.c ./config.make ./gc.h ./cadr.c ./LICENSE ./struct.h ./termios.h ./sizesort.awk ./vm.c ./tl.vim ./y.tab.c ./socket.h ./termios.c ./.gdb_history ./vm.h ./eval.c ./arith.h ./parser.y ./linenoise ./checkman.txr ./.git ./tree.h ./txr.vim ./parser.l ./configure ./sizesort0.awk ./txr.1 ./vmop.h ./combi.c ./inst.nsi ./protsym.c ./genprotsym.txr ./share ./time.c ./arith.c ./chksum.c ./lisplib.h ./y.tab.c.shipped ./tests ./hash.h ./chksums ./signal.c ./strudel.c ./y.tab.h.shipped ./ftw.c ./tree.c ./sysif.c ./syslog.h ./strudel.h ./txr-manpage.pdf ./buf.h ./glob.c ./regex.h ./stream.c ./gs_YEC3Hr ./HACKING-toc.txr ./txr.h ./cadr.h ./debug.h ./unwind.c ./optand.tl ./lex.yy.c.shipped ./lib.c ./eval.h ./y.tab.h ./itypes.c ./gs_P4Z02S ./ChangeLog-2009-2015 ./buf.c ./glob.h ./tags.tl ./syslog.c ./gs_8aK1VJ sizesort0.awk: function compare(ia, a, ib, b) { return length(a) - length(b) } { dir[NR] = $0 } BEGIN { RS = "\0" } END { asort(dir, sdir, "compare") for (x in sdir) { printf "%s\0", sdir[x] | "xargs -0 ls -fd" } } What we could use here is a "ls -0" option that is like "ls -1" but with null termination. And likewise some option to have ls read file names from standard input. Line-wise by default, or null-terminated if -0 is specified. So easy in a language with more well-rounded functionality: 1> (run "ls" (cons "-fd" [sort (get-line
Re: version-sort ugliness or bugs
On 2021-04-15 18:44, Erik Auerswald wrote: Hi, On Thu, Apr 15, 2021 at 11:47:34PM +0200, Vincent Lefevre wrote: I'm currently using version-sort in order to get integers sorted in strings (due to the lack of simple numeric sort like in zsh), but I've noticed some ugliness. This may be bugs, not I'm not sure [ ... ] I think all of your problems ("ugliness") is caused by the concept of "file extensions" in GNU Coreutils version sort. https://www.gnu.org/software/coreutils/manual/coreutils.html#Special-handling-of-file-extensions That strikes me as a very poor set of requirements. The treatment of suffixes is extremely hacky, and unnecessary. Here is an algorithm + implementation I hacked up in 15 minutes. Here is the informal spec. Note that it makes no mention of special case hacks for suffixes, yet suffixes end up treated reasonably: 1. A string is parsed into tokens. There are three kinds of tokens: - DOT: (".") - INT: decimal string (e.g. "123") - TXT: sequence of other characters 2. INT tokens are converted to integer values. 3. The token sequence is parsed in order to shore up INT DOT INT { DOT INT }* ... sequences into (INT INT ...) lists. 4. Any other INT token not placed into a list is turned into the a list of one integer (INT) Then, the resulting sequence is compared as follows: - TXT-TXT comparisons are ordinary lexicographic - LIST-LIST comparisons are lexicographic on the list of integers - Otherwise, the sorting order is DOT < TXT < LIST Sample implementation in TXR Lisp. Note: to achieve DOT < TXT, we replace "." tokens with the character object #\. The TXR Lisp less function then takes care of it: (less #\a "a" '(1 2 3)) -> t Run: $ txr versort.tl abc.txt abc-1d.2c.tar.gz abc-1.2.tar.gz abc-1.2c.tar.gz abc-1.2.3.tar.gz abc-1.2.3-3.14.tar.gz abc-1.2.3-4.5.tar.gz abc-1.2.3-9.tar.gz abc-1.2.3-9.tgz abc-1.2.3-9-sig.bin abc-1.2.3.3.14.tar.gz abc-2-tar.gz abc-11-tar.gz foo.txt zzz-3.0 zzz-4.0 zzz-xyz-4.5 zzz-xyz-9.15.3 Code in versort.tl abc-1d.2c.tar.gz is before abc-1.2 because d is not part of the version number. This is a case of version 1 coming before 1.2. (Don't have trailing junk in your version numbers, except possibly at the very end; keep them numeric!) (defun ver-tok (str) (tok #/\.|\d+|[^\d.]+/ str)) (defun ver-parse (str) (let ((all-toks (ver-tok str))) (labels ((convert (toks) (mapcar [iffi (fr^ #/[0-9]/) toint] toks)) (parse (:match) (((@(integerp @a) "." @(integerp @b) . @rest)) (parse (cons (list a b) rest))) (((@(integerp @a) . @rest)) (parse (cons (list a) rest))) (((@(listp @a) "." @(integerp @b) . @rest)) (parse (cons (append (flatten a) (list b)) rest))) ((("." . @rest)) (cons #\. (parse rest))) (((@a . @rest)) (cons a (parse rest))) ((@else) else))) (parse (convert all-toks) (defun ver-recombine (vsyntax) (cat-str (mapcar [iffi consp [chain (op mapcar tostring) (ap join-with ".")]] vsyntax))) (defun ver-sort (strings) [mapcar ver-recombine (sort [mapcar ver-parse strings])]) (let ((data '("abc-1.2.3.tar.gz" "zzz-4.0" "abc-11-tar.gz" "abc-2-tar.gz" "abc-1d.2c.tar.gz" "abc-1.2c.tar.gz" "abc-1.2.3-9-sig.bin" "abc-1.2.tar.gz" "abc-1.2.3-9.tar.gz" "abc-1.2.3-3.14.tar.gz" "abc-1.2.3.3.14.tar.gz" "abc-1.2.3-9.tgz" "zzz-3.0" "foo.txt" "abc.txt" "abc-1.2.3-4.5.tar.gz" "zzz-xyz-9.15.3" "zzz-xyz-4.5"))) (tprint (ver-sort data)))
Re: [PATCH] cksum: Use pclmul hardware instruction for CRC32 calculation
On 2021-03-14 12:55, Jeffrey Walton wrote: The underlying problem is GCC, Clang and friends conflate the user's ISA with ISA the compiler uses. They are not the same - they are distinct. Unfortunately, GCC and Clang never addressed the underlying problem. Sorry, what does that mean? GCC works fine as a cross-compiler. E.g. built to run on the x86_64 ISA, but putting out Aarch64 code. The "Submodel" options of GCC are determined by the configuration: how that GCC was built. On x86 9and maybe others), there is a "native" argument for -march and -mtune as in -march=native. If GCC is configured that way, it will generate code according to the processor of the machine it is running on. (Unless, I'm guessing, it's built as a cross-compiler, so the build machine's architecture is irrelevant.)
Re: [PATCH] cksum: Use pclmul hardware instruction for CRC32 calculation
On 2021-03-12 07:33, Kristoffer Brånemyr via GNU coreutils General Discussion wrote: Hi, I was just wondering if you are planning to merge the change, or if you decided against it? :)I wanted to use the cpuid.h autoconf detection for another patch I'm working on. Regarding the comment "Since the time the process spends waiting on syscalls (fread) is still the same, actual real time speedup is only 3x. It would be an interesting exercise to try to use async IO, so you could checksum one block while reading the next. Maybe I will try that one day." You never know, but probably not. If the 3x performance was achieved with with a hot cache, then async I/O probably isn't going to do anything, since everything is in RAM already. When the cache is pre-loaded, the I/O syscalls are pure CPU overhead, since nothing is waiting on any real I/O. I would try these improvements, in order: - Don't use stdio fread, which is an extra layer of calls and buffering over read. Use read, and play with different buffer sizes. - Use mmap to map the file to memory, and then crc32 that buffer. In the non-hot-cache case where async I/O might help, you can likewise get a potential improvement with mmap by using madvise with MADV_SEQUENTIAL to give it a hint that you're performing sequential access (which benefits from reading ahead).
Re: Add dry-run option to mv
On 2021-03-10 13:59, L A Walsh wrote: On 2021/03/07 03:20, Emilio Garcia wrote: Hi all, I checked out the coreutils repo on Github and I would like to ask you to add a dry-run option in the mv command. When I've needed such functionality, I insert an 'echo' before the 'mv' command, so in a script: cmd=eval if ((dry_run)); then cmd=echo fi Me too; but that doesn't validate the arguments like Emilio wants. E.g. mv --dry-run existing-file nonexistent-dir/ # error mv --dry-run nonexistent-object somewhere # error mv --dry-run object /another/filesystem # diagnostic
Re: Sorting SNMP numeric OID's?
On 2021-02-22 07:31, Ed Fair via GNU coreutils General Discussion wrote: Has it ever been discussed to add an option to the sort utility for sorting numeric SNMP object identifiers by sub-identifier? Probably not, but what has been discussed is sorting version numbers like 1.2.3. How are SNMP OIDs different from version numbers for the purposes of sorting? GNU Coreutils sort supports -V/--version sort. Do you know about it, and have you tried it? What requirements in relation to SNMP OIDs are not met by this feature?
Re: chmod: man page clarification for letter X
On 2020-10-08 10:28, Tomás Fernandes wrote: Hello, I've recently come across chmod's man page as someone who is not very experienced (1st year CS undergrad), and found that the definition of the letter X in the man page a bit unclear, more specifically this part (in bold): On the topic of chmod documentation, it could use a clarification in the following matter. chmod supports a = operator for copying permissions. For instance u=g means "make the u permissions be like g". chmod also supports multiple operations, like u=g,g=o The behavior of GNU Coreutils chmod is that the = operator samples the most *new* value of the permissions (everything to the left has already taken place). This is true even without the comma separation, when = is combined in one clause with other operators, as in: o+x=o Here, the o+x will apply the x permission to o. Then this effect is settled and the =o assignment therefore has no effect; it's the same as: o+x,o=o Or something like that; I've not looked at this stuff in a while, but it was one of the issues I ran into when making a chmod implementation. It would be good if the documentation spelled it out that = references the new permissions which result from everything to the left of = having been processed.
Re: chmod: man page clarification for letter X
On 2020-10-08 10:28, Tomás Fernandes wrote: Hello, I've recently come across chmod's man page as someone who is not very experienced (1st year CS undergrad), and found that the definition of the letter X in the man page a bit unclear, more specifically this part (in bold): execute/search only if the file is a directory or already has execute permission *for **some user* (X) In my opinion this could be worded slightly better and more clearly as: execute/search only if the file is a directory or already has execute permission *for **user, group or other* (X) How about "X behaves like x if the object already has any execute permission bit set, otherwise it has no effect."
Re: wc feature request
On 2020-10-05 08:40, A B wrote: Many thanks for all the much needed contributions to society as a whole. I did have one feature to request for wc, which I think would be highly complementary to grep’s -q flag. It would be really cool if wc could have a -q flag as well, which could return matches within a predefined threshold as the exit code itself. So for example, if I wrote ‘wc -l -q’ at the end of a pipe, then no output would be returned, but the exit code would return a 3 if three lines were found. I don't see this exact feature in the documentation of GNU grep. grep terminates with a 0 status (success) when matches are found, and this is true with -q. The idea has limited applicability; there are only as few as 8 bits (or fewer?) available in the process status word for encoding the exit code. It could be useful for counting the number of lines or characters in files that are somehow guaranteed to be small. The inversion of the exit success polarity is also troubling. If nothing is counted, that's 0 (success), whereas if anything is counted, that is a termination failure.
Re: What is the interpretation of bs of dd in terms of predicting the disk performance of other I/O bound programs?
On 2020-09-23 09:56, Peng Yu wrote: Hi, Many people use dd to test disk performance. There is a key option dd, which I understand what it literally means. But it is not clear how there performance measured by dd using a specific bs maps to the disk performance of other I/O bound programs. Could you anybody let me know the interpretation of bs in terms of predicting the performance of other I/O bound programs? Thanks. The bs likely maps to performance like this: perf (fraction of max) 1.0| ___-_ | _/ | _/ | / | | | / 0|| +--||-- bs AB A bs of zero is impossible, so we can call that point "no performance". Ridiculously small values of bs will cause the program to be doing too many system calls. The larger the bs, the fewer syscalls dd has to make, so there is some improvement with diminishing returns until the maximum theoretical performance is reached for that OS, hardware and approach (read/write loop). Then if bs gets ridiculously large, so that the buffers don't fit into the on-chip CPU caches, then there are almost certainly negative returns. The range of sizes from A to B is probably wide enough, that an intelligent guess at a good bs size is likely to land in it.
Re: date: unclosed date string comments
On 2020-08-05 18:52, sunnycemet...@gmail.com wrote: Hello. Given this documentation: Comments may be introduced between round parentheses, as long as included parentheses are properly nested. Is this considered a bug: ■ LC_ALL=C date -d '(test 1 2 3' Wed Aug 5 00:00:00 EDT 2020 ■ LC_ALL=C date -d '((test 1 2 3)' Wed Aug 5 00:00:00 EDT 2020 Thank you. This is a lack of diagnosis that adds up to it being a feature that a comment which does not close is closed implicitly by the end of the string. Once this kind of thing escapes into the wild, the safest thing is to document it. A fix for this (like making date exit with a diagnostic and failed termination status) will break something for someone somewhere. Unix is historically awful for this sort of looseness. For instance an extra closing parenthesis in an extended regex is treated as literal. Too late to fix, this had to be enshrined in POSIX scripture: The shall be special when matched with a preceding , both outside a bracket expression. Particularly in the earlier history of Unix, a lot of it was geared toward getting the happy cases working with the least amount of code. At least the C compilers grew up quite a bit. You can fix looseness in compilers more easily because when you tighten something in a compiler, the diagnostic blows up in the face of a developer. If you tighten something in a utility, something breaks in the field, because it is a run-time check.
Re: Enhancement Request for sha256sum - output only the SHA-256 hash alone
On 2020-07-17 14:33, Pádraig Brady wrote: On 17/07/2020 15:21, jens.archlinux jens wrote: Hello, propose to add a new option for sha256sum to output only the SHA-256 hash alone, without a trailing file name and without a trailing newline. (This would make sense for one input file only). It would make shell scripts that use sha256sum much simpler. Currently it is necessary to split the output of sha256sum to obtain the hash, which usually requires an additional command / Unix process. This is one of those trade-offs. I'd be 60:40 against adding such an option, because it's so easy to implement with cut(1): Can I muse about checksum utility design? Someone once, who didn't understand Unix utility design principles, had the dumb idea of polluting the output of a checksumming utility with extraneous information. Somehow that became a meme for authors of new checksumming utilities, though not so rigid a meme that they would actually make those outputs compatible with their predecessors. Maybe it was the same person who thought it's a good idea for "dd" to output, by default, cruft like: 0+0 records in 0+0 records out 0 bytes (0 B) copied, 0.726321 s, 0.0 kB/s Did that person ever work at Microsoft on MS-DOS? It's suspiciously reminiscent of: C:\Users\kaz>copy foo.txt bar.txt 1 file(s) copied. (Thank you; I would never be able to guess that one file was copied from the fact that I specified one file, and the command's termination status was successful). I'm against adding the option for a this reason: the default behavior of a checksum function should be to output nothing but the checksum. Note that the word "sum" is redundant in "sha256sum". Thus there is an opportunity for a "sha256" utility which just outputs nothing but the sum. That utility could bee sha256sum itself, upon detecting that argv[0] ends in "sha256", though that is risque. Also, that utility should perhaps calculate a continued sum when given multiple arguments, and not individual sums. So that is to say: sha256 a b c sha256 <(cat a b c) should be the same. Now let's talk options. It should have two, -i and -f: sha256 -i [ inputs ... ] would calculate the hashes over the inputs, starting with the specified state. The special token of 0 (the ASCII zero digit) would mean "the initial state". In the -i mode, sha256 would output a string (in an unspecified, opaque format, perhaps inspired by "stty -g") which encodes the newly updated state. The string should have no quoting or escaping issues for shell programming. The output of sha256 -i would be suitable as an argument to the -i option of a new command, to continue the hashing operation over additional inputs. It would also be suitable as an argument to -f, so that: sha256 -f [ inputs ... ] would process inputs (if any) just like sha256 -i , and then do the hash finalization, and output not another state cookie, but the final hash. Thus, the output of sha256 a b c could also be obtained using: st=$(sha256 -i 0) st=$(sha256 -i $st a) st=$(sha256 -i $st b) st=$(sha256 -i $st c) sha256 -f $st or: st=$(sha256 -i 0 a b c) sha256 -f $st or: st=$(sha256 -i 0) sha256 -f $st a b c or, "point-free" application: sha256 -f $(sha256 -i 0 a b c) etc. I would add one more option: -s (literal string, not file name). Whenever one or more -s options are present, their argument values are pulled into the hash, in the order they appear, before any files. Thus: $ sha256 -s coreutils 3993c379c029014a9c4b2adf5d23397b3c7421467a0cb3575ff925bb6f6329b0 $ sha256 -s core -s utils 3993c379c029014a9c4b2adf5d23397b3c7421467a0cb3575ff925bb6f6329b0 $ sha256 -f $(sha256 -i 0 -s core -s utils) 3993c379c029014a9c4b2adf5d23397b3c7421467a0cb3575ff925bb6f6329b0 -i and -f are mutually exclusive, and must precede any -s options.
Re: mv w/mkdir -p of destination
On 2020-07-03 14:38, Bernhard Voelker wrote: On 2020-05-11 05:16, Vito Caputo wrote: Does this already exist? Was just moving a .tgz into a deep path and realized I hadn't created it on that host, and lamented not knowing what convenient flag I could toss on the end of the typed command to make `mv` do the mkdir -p first for me. I was surprised to not see it in mv --help or mv(1) when I checked for next time... mv(1) is ... well, for moving files and directories. If we consider a filesystem to be a collection containing a name space which assigns path names to objects, then mv is a tool for re-assigning a new path name to an object. Years ago I implemented this concept in a version control system. It has a mv command which works regardless of whether directories exist. (In fact instead of using the mv command, you can edit the representation of the directory structure, and then run an update command to re-shape the workspace accordingly; mv works by doing the same thing.) However, the tool did not store a representation of directories at all. Just files and symbolic links. If a directory-restructuring operation renames all the files out of a directory, that directory is removed (unless it contains local, untracked content). Its parent is removed if it becomes empty and so on. It works beautifully and is intuitive to use. And creating the destination directory in the same go seems to be a seldom case. We see also from the missing answers so far, that nobody seems to be much excited about this feature. Anyway, as it's very easy to work around it with a separate mkdir(1) Not to mention rmdir! If you have moved all content out of a directory, you may not want it to exist. For symmetry "mv -p" should remove all empty directories left behind, as far up the tree as possible and as permissions will allow. it's better to avoid adding complexity to the code. Of course, the complexity doesn't go away; i.e. it stays with the user to grapple with. Though forty years of Unix users don't seem to have minded all the various inconvenience, so why bother.
Re: Disable b2sum from coreutils?
On 2020-07-01 22:18, Jeffrey Walton wrote: Hi Everyone, The BLAKE2 folks have optimized implementations for b2sum on i686, x86_64, NEON and PowerPC. It also has more options than the coreutils version. I'd like to disable b2sum in coreutils and use the BLAKE2 team's version. This is a job for your open source system distribution. The approach taken on some distros is to build every package normally. If two or more packages provide the same executable, there is a mechanism in place to choose which one is installed. It may be that all are installed, but under an altered name like say "/usr/bin/b2sum.coreutils" and "/usr/bin/b2sum.blake2". Then the resolution system chooses one of these as the target of a /usr/bin/bsum2 symbolic link. The renaming and symlinking are done outside of the build systems of the programs; they are arranged by the distro build system. The distro build system redirects the "make install" of a package into a temporary install directory which is for that package only. Then the installed materials are further manipulated. For instance, the materials may be split into development, run-time and documentation parts, which become separate packages. It is at this stage that clashing executable might be renamed. The virtue of this system is that the end user has a way to choose which binary dominates, without the upstream packages having to be rebuilt; all the packages have all the binaries. I think pretty much any major distro has a way to do this; find out how yours is doing it. If you're building your own local package from sources (like say blake2) and would like its b2sum to be used instead of the one in /usr/bin, then simply make sure that /usr/local/bin is ahead of /usr/bin in your PATH.
Re: feature request: better intuitive syntax LINK=TARGET
On 2020-06-24 19:35, Andrej Surkov wrote: Hi all! ln syntax is very uncertain - every time I use ln I'm confused what is correct "ln -s LINK TARGET" or "ln -s TARGET LINK"! Of cause I can check man ln or ln --help, but what if we add unambiguous syntax, for example ln -s LINK=TARGET mv existing new cp existing new ln existing new ln -s rel-or-abs-path new The confusing thing in the ln -s case is that if the path is relative, it is resolved with respect to the directory of new, not the current directory where the command is executing. I suspect that this altered semantics of the source argument is actually the root cause of then becoming confused about the order of the arguments. Since the rel-or-abs-path often isn't a working path from the current directory to the desired link target, but simply specifies the content of the link, your brain thinks of the operation as a variable assignment: stuffing the specified link with the given literal value. I think this is what is fixed by the -r option of the GNU Coreutils ln. If you use -r, then it's just: ln -sr orig new # like plain ln, cp or mv With -r, if orig is relative, it is understood relative to the current working directory, not to the link's directory. If the object exists, then orig is the actual path from here to that object. The orig and new paths are canonicalized, and then the relative path R from $(dirname new-canon) to orig-canon is calculated. Then the link is created as if ln -s R new. I suspect if you start using "ln -sr original-item link-to-it", and no longer think of the operation as stuffing a literally specified piece of content into the variable-like link object, but as creating a virtual copy of original-item named link-to-it, the recurrent confusion may be cured.
Re: [PATCH] md5sum: add an option to change directory
On 2020-05-20 14:15, Bertrand Jacquin wrote: In the fashion of make and git, add the ability for all sum tools to change directory before reading a file. $ sha256sum /etc/fstab b5d6c0e5e6bc419b134478ad7b3e7c8cc628049876a7772cea469e81e4b0e0e5 /etc/fstab Make requires this option because it reads a Makefile, and those often contain relative references which assume that the Makefile's directory is current. The inputs to md5sum don't don't contain path references that break. In other regards, every tool that does anything with files could have a -C option: Copying files: cp -C /etc fstab fstab.bak Executing a script: sh -C /etc rc.local Editing: vi -C /etc fstab Where does it end? $ sha256sum -C /etc fstab b5d6c0e5e6bc419b134478ad7b3e7c8cc628049876a7772cea469e81e4b0e0e5 fstab The net effect is that just the output has changed to omit the path name. Maybe this wants to be a --strip or -p option like with diff or patch, or --basename-only to strip a variable number of components, leaving only the last. If I want to print a simplified name, I don't want to do this: md5_short() { local dir=$(dirname "$1") local base=$(basename "$1") md5sum -C "$dir" "$basename" } md5short /path/to/whatever I just want this: md5sum --basename /path/to/whatever The -C functionality can easily be done with subshells, or with a chdir() after fork(), before exec(): In a script, instead of "make -C dir", you can always do (cd dir; exec make). In C, make yourself a spawning function that has the dir-changing functionality built in: spawn_program(path_to_executable, /* or let PATH be searched */ change_to_this_directory, use_these_args, these_env_vars);
Re: suggestion: /etc/dd.conf
On 2020-04-28 02:14, turgut kalfaoğlu wrote: I would like to suggest and in fact volunteer to create a conf file option to 'dd'. By doing that you're replacing function arguments with global variables, which is a bad idea. It has dozens of hard to remember options, and there are some that I would like to use all the time. Look into shell functions and aliases. For example, I am currently doing: $ sudo dd if=CentOS-6.10-x86_64-LiveDVD.iso of=/dev/sdc bs=4096 conv=fsync right now, and I have to lookup the conv=fsync option every time I want to write to a USB drive. It's unlikely that this option is required; have you tried it without? To make sure any buffered writes are flushed, do a "sync" after the entire dd operation.
Re: statically linking coreutils 8.32
On 2020-03-19 01:54, Gabor Z. Papp wrote: lo lo, while trying to statically link coreutils 8.32 on linux x86_64, I'm getting the following error: Static linking has not been supported by Glibc for many years now; so you can at best get a program's own components to be static, but not down to fully static executable with linked-in libc. For static executables, you have to use a C library that supports static linking: musl or whatever.
Re: failing CI jobs
On 2020-03-18 04:27, "Toni Uhlig (Smartphone)" via GNU coreutils General Discussion wrote: There are a lot of failing CI jobs and nobody seems to care about. Some of them seem to fail since two+ years ago. Why not disable them, if nobody cares about? Source: https://hydra.nixos.org/job/gnu/coreutils-master Firstly, this specific page is not found; 404 error. Secondly, more generally, surely NixOS is not hosting CI for GNU Coreutils development?
Re: altchars for base64
On 2020-03-15 09:00, Assaf Gordon wrote: Hello, On 2020-03-15 12:12 a.m., Kaz Kylheku (Coreutils) wrote: On 2020-03-14 22:20, Peng Yu wrote: Python base64 decoder has the altchars option. [...] But I don't see such an option in coreutils' base64. Can this option be added? Thanks. # use %* instead of +/: base64 whatever | tr '+/' '%*' The reason for alternative characters is typically do use then in URLs, where "/" and "+" are problematic. A new command "basenc" was introduced in coreutils version 8.31 (released last year) which supports multiple encodings. If your script has to work in installations that aren't up to Coreutils 8.31, or don't use Coreutils at all (base64 comes from somewhere else), you need the tr trick or its ilk.
Re: altchars for base64
On 2020-03-14 22:20, Peng Yu wrote: Hi, Python base64 decoder has the altchars option. https://docs.python.org/3/library/base64.html base64.b64decode(s, altchars=None, validate=False)¶ But I don't see such an option in coreutils' base64. Can this option be added? Thanks. # use %* instead of +/: base64 whatever | tr '+/' '%*'
Re: RFC: du reports a 1.2PB file on a 1TB btrfs disk
On 2020-03-10 21:31, Jim Meyering wrote: On Tue, Mar 10, 2020 at 12:24 PM Kaz Kylheku (Coreutils) <962-396-1...@kylheku.com> wrote: On 2020-03-10 11:52, Jim Meyering wrote: > Otherwise, du provides no way of seeing how much of the actual disk > space is being used by such FS-compressed files. If you stat the file, what are the values of st_size, st_blksize and st_blocks? That particular file is long gone, but I've just created a 1.8T file on a 700G file system. Before I began this experiment, "Avail" was 524G, so it appears to occupy about 60G actual space. Sorry; forget I mentioned st_blksize; I forgot that st_blocks is measured in 512 byte blocks regardless of st_blksize. FTR, I created the file by running this: yes $(printf '%065535d\n' 0) > big $ stat big File: big Size: 1957123607586 Blocks: 3822507048 IO Block: 4096 regular file So here, the Blocks value (coming from st_blocks) doesn't inform us differently from size; if we multiply it by 512, it matches the size exactly. The underlying FS can use the st_blocks value to indicate the actual storage. For instance, if I do this on ext4: # dd of=file seek=$((1024 * 1024)) count=1 if=/dev/zero Then: # du -h file 12K file # du --apparent-size -h file 513Mfile The apparent size comes from the st_blocks information in the stat structure: # stat file File: `file' Size: 536871424 Blocks: 24 IO Block: 4096 regular file Device: 902h/2306d Inode: 1624448 Links: 1 Access: (0644/-rw-r--r--) Uid: (0/root) Gid: (0/ root) Access: 2020-03-11 04:22:26.0 -0700 Modify: 2020-03-11 04:22:26.0 -0700 Change: 2020-03-11 04:22:26.0 -0700 The issue you are seeing here is that btrfs should be probably be publishing a st_blocks value that matches the actual storage, accounting for sparseness and compression, and not just a repetition of the size, rounded up to a block and quoted in 512 byte units. The fidelity of the du output is only as good as what is in stat.
Re: RFC: du reports a 1.2PB file on a 1TB btrfs disk
On 2020-03-10 11:52, Jim Meyering wrote: Otherwise, du provides no way of seeing how much of the actual disk space is being used by such FS-compressed files. If you stat the file, what are the values of st_size, st_blksize and st_blocks?
Re: ls feature request
On 2020-02-21 10:32, Riccardo Mazzarini wrote: Hi Kaz, this works almost perfectly but it fails with filenames that contain spaces. I tried using quotation marks, i.e. ls -dU "$(find .* * -maxdepth 0 -not -type d | sort ; find .* * -maxdepth 0 -type d | sort)" but that didn't work. Any ideas? I can answer that in three parts of increasing complexity. The remaining caveat is that since we are relying on passing all names as arguments to a single invocation of "ls", these solutions are all susceptible to the kernel's argument passing limit. Part 1: Solutions involving capturing the output of a program and interpolating it as arguments for ls will not work. Or if they are made to work, they will require a clumsy escaping-and-eval job. So we switch to another method. If the only issues with names are spaces and control characters, but no spurious newlines, so that the output of "find" has exactly one name per line, then we can use xargs: (find .* * -maxdepth 0 -not -type d | sort ; find .* * -maxdepth 0 -type d | sort) | xargs ls -dU Note that xargs cannot use your shell alias for ls. If you want colors, you have to add --colors=auto Part 2: If the names can be completely arbitrary strings, and include newlines, then we have "find -print0" that will output names as null terminated strings, and we have "xargs -0" that reads null-terminated strings. What we don't have is a "sort" that does null-terminated string I/O. But, what we do have is GNU Awk. GNU Awk can separate input according to arbitrary records, using a regular expression. In GNU Awk's regular expression syntax, we can specify the null byte as \0. Watch this. Here is a little test directory with some files: ~/test $ ls cert.pem char.c hello.c Makefile palin.tl str.sh char hello lex.awk notreached.c pushl.s We can pass these as null-terminated strings with "find -print0" to a gawk script which handles them just fine and prints them as newline-terminated lines: $ find . -print0 | gawk -v 'RS=\0' 1 . ./hello ./Makefile ./lex.awk ./palin.tl ./char ./char.c ./hello.c ./cert.pem ./notreached.c ./pushl.s ./str.sh And with -v 'ORS=\0', it will output null terminated records too! But we won't be making use of this. With the above, we can implement a sort easily: # Null terminated string sort using GNU Awk gawk -v 'RS=\0' '{ line[NR] = $0 } END { asort(line); for (l in line) { printf("%s\0", line[l]); } }' It's quite a mouthful, so let's move the RS assignment into a BEGIN block and put the whole awk script into a variable called sort0: sort0='BEGIN { RS = "\0" } { line[NR] = $0 } END { asort(line); for (l in line) { printf("%s\0", line[l]); } }' With that variable, we can now have: (find .* * -maxdepth 0 -not -type d -print0 | gawk "$sort0" ; find .* * -maxdepth 0 -type d -print0 | gawk "$sort0" ) | xargs -0 ls -dU Part 3: Since we're using Gawk, we could run a single "find" job and use logic inside the Gawk script to do the separation of directories and non-directories. To distinguish the two, we can use GNU find's -printf instead of -print0. We can print directory names with a "d" prefix, and other entries with a "-" prefix. My attempt at this script looks like this: #!/bin/bash (find .* * -maxdepth 0 \ \( -not -type d -printf "-%p\0" \) -o \ \( -type d -printf "d%p\0" \) ) | \ gawk 'BEGIN { RS = "\0" } /^-/ { nondir[NR] = substr($0, 2) } /^d/ { dir[NR] = substr($0, 2) } END { asort(nondir) asort(dir) for (l in nondir) printf("%s\0", nondir[l]); for (l in dir) printf("%s\0", dir[l]); }' | \ xargs -0 ls -dU --color=auto As we want, the script handles the case when I have a file created using: $ touch 'foo bar' it ends up displayed as 'foo'$'\n''bar', indicating that it got passed through correctly through the plumbing all the way to the final ls -dU.
Re: ls feature request
On 2020-02-20 16:01, Riccardo Mazzarini wrote: The ls programs currently provides a "--group-directories-first" option, to group directories before files. I'd be nice to have the opposite option, "--group-directories-last" or "--group-files-first", to group files before directories. Workaround applicable in "ls -l" case: ls -l | awk '/^d/ { dirs = dirs $0 "\n"; next } 1 ; END { printf("%s", dirs); }'
Re: BUG in sort --numeric-sort --unique
On 2020-02-13 14:00, Stefano Pederzani wrote: In fact, separating the parameters: # cat controllareARCHIVIO_2020/02/controllare20200213.txt | sort -u | sort -n | wc -l 1262 we workaround the bug. My own experiment shows confirms things to be reasonable. When -n and -u are combined, then uniqueness is based no numeric equivalence. Since numeric equivalence is weaker, de-duplication based on numeric equivalence can cull out more records than de-duplication based on textual equivalence. $ printf "0\n00\n000\n" | sort -u 0 00 000 $ printf "0\n00\n000\n" | sort -n 0 00 000 $ printf "0\n00\n000\n" | sort -nu 0 $ printf "0\n00\n000\n" | sort -n | sort -u 0 00 000 $ printf "0\n00\n000\n" | sort -u | sort -n 0 00 000 As you can see, sort -nu is not equivalent to any combination of sort -n and sort -u. sort -nu has de-duplicated a file of different "spellings" of zero down to a single entry. sort -u may not de-duplicate these entries because "0" is textually different from "00". Every line is only something like "1.2.3.4". Unfortunately, "sort -n" will probably not do what you think with this data. Please read sort's GNU Info documentation; the man page lacks detail about what numeric sorting means. Also, the POSIX standard's description of -n: https://pubs.opengroup.org/onlinepubs/9699919799/utilities/sort.html In short, what -n does is recognize a *prefix* of each line as a number according to a pattern that includes optional blanks, an optional sign, digits, a radix character, and digit group separators. -n does not deal with compound numeric identifiers like 1.2.3.4. Basically 1.2.3.4 and 1.2.4.4 both look like the number 1.2. $ sort -nu 1.2.3.4 1.2.4.4 1.2.5.6 [Ctrl-D][Enter] 1.2.3.4 Oops! This result is correct; under numeric sort (-n), all these lines are considered to have the key 1.2. And if we de-duplicatd based on that, they are all considered to be duplicates; they de-duplicate down to a single line.
Re: gcc10's -Wreturn-local-addr gives FP warning about lib/careadlinkat
On 2020-02-06 09:05, Jim Meyering wrote: On Thu, Feb 6, 2020 at 6:03 AM Pádraig Brady wrote: On 06/02/2020 00:27, Jim Meyering wrote: > Building latest latest coreutils using latest-from-git gcc10 evokes > this false positive: > > lib/careadlinkat.c: In function 'careadlinkat': > cc1: error: function may return address of local variable > [-Werror=return-local-addr] > lib/careadlinkat.c:73:8: note: declared here > 73 | char stack_buf[1024]; > > I'm guessing improved flow analysis will eventually suppress this. I > hesitate to turn off the useful and normally-high-S/N > -Wreturn-local-addr globally. Maybe just disable it in that one file, > temporarily? The logic of the function looks fine. Would an `assure (buf != stack_buf)` before the `return buf` indicate that constraint to gcc with minimal runtime overhead? I would have preferred that, but it has no effect. I then tried to suppress that warning in the affected file by adding these lines: /* Without this pragma, gcc 10.0.1 20200205 reports that the "function may return address of local variable". */ # pragma GCC diagnostic ignored "-Wreturn-local-addr" But, surprisingly, that didn't help, either. Also tried Kaz Kylheku's return-early suggestion, to no avail. I have other thoughts about this. There is a well-known technique for using a array for small arrays up to a certain size, after which a dynamic array is used. That technique is useful in cases when dynamic allocation is avoided entirely. If the array has to eventually go into dynamic storage, because it is returned, then it can just start out that way. So that is to say, this justification for the stack_buf is pretty poor: /* Allocate the initial buffer on the stack. This way, in the common case of a symlink of small size, we get away with a single small malloc() instead of a big malloc() followed by a shrinking realloc(). */ The common case is in fact small symlinks; I can't remember when I've seen a symlink that was anywhere near a kilobyte long. If you start with, say, a 128 byte malloc buffer, or even a 64 byte one, there is hardly any need to realloc that to a smaller size, and doing so for chunks of that size might not even actually make any memory available, depending on the allocator. The vast majority of symlinks you will ever read will fit into 128 bytes. Also think about this: depending on the exact filesystem, small symlinks are stored directly in the inode (or perhaps even directory entry?) Whereas large symlinks have to go to a separate block. So, okay, there is an overhead *inside* readlink for fetching a large symlink, and that overhead dwarfs the user-space concern of whether an extra realloc is called. readlink may have to read a whole extra data block of storage containing the symlink, on a cache-cold system. That could result in a disk seek, bloating up the time into a range measured in milliseconds. Basically, the initial guessimate of the required space for a symlinks should probably be more or less aligned with a reasonable estimate of the symlink size that is efficiently handled at the filesystem level.
Re: gcc10's -Wreturn-local-addr gives FP warning about lib/careadlinkat
On 2020-02-05 16:27, Jim Meyering wrote: Building latest latest coreutils using latest-from-git gcc10 evokes this false positive: lib/careadlinkat.c: In function 'careadlinkat': cc1: error: function may return address of local variable [-Werror=return-local-addr] lib/careadlinkat.c:73:8: note: declared here 73 | char stack_buf[1024]; I'm guessing improved flow analysis will eventually suppress this. By chance, does this make it go away away (my changes in #else parts of #ifdef)? if (link_size < buf_size) { buf[link_size++] = '\0'; if (buf == stack_buf) { char *b = (char *) alloc->allocate (link_size); buf_size = link_size; if (! b) break; memcpy (b, buf, link_size); #ifdef OLD buf = b; #else return b; #endif } else if (link_size < buf_size && buf != buffer && alloc->reallocate) { /* Shrink BUF before returning it. */ char *b = (char *) alloc->reallocate (buf, link_size); #ifdef OLD if (b) buf = b; #else if (b) return b; #endif } return buf; }
Re: What is the difference between unlink and rm -f?
On 2020-01-29 01:45, Peng Yu wrote: Hi, It seems to me unlink and rm -f are the same if the goal is the delete files. When are they different? Thanks. I answered this on Unix Stackexchange in 2016: https://unix.stackexchange.com/a/326711/16369 :)
Re: Is it safe to replace dd?
On 2020-01-20 04:14, microsoft gaofei wrote: Many people suggest using dd to create bootable USB, https://www.archlinux.org/download/ . But cp and mv also writes to USB, e.g., cp archlinux-2020.01.01-x86_64.iso /dev/sdb, cat archlinux-2020.01.01-x86_64.iso > /dev/sdb. Is it safe to use these commands instead of dd? If it's unsafe, I want to know the reason. dd was required on ancient Unix systems for dealing with "raw" devices that had mandatory block sizes. For instance, if a raw device such as a hard disk or tape drive, had a block size of 512, then writing to it required a sequence of correctly sized write system calls. If the program wrote 512 bytes, the device driver would truncate the write to 512. If the program wrote fewer than 512 bytes, then it wouldn't completely overwrite the block, yet the position would advance to the next block. Maybe garbage would be left in the partial block, or zeros. With reads there would be a similar problem. A 256 byte read on a raw device with a 512 block size would result in a truncated read (very reminiscent of a truncated UDP datagram receive). The dd program's block size feature would ensure that reads and writes involving raw devices were performed correctly. With dd you can read from a raw device with 256 byte blocks, and output to a device with 1024 byte blocks, an operation called "re-blocking". The block devices you're working with in a GNU/Linux system aren't raw. You can write to them in whatever request sizes you want. The aggregation into correct transfer units is done by the block driver software inside the kernel. There is a small advantage in writing a multiple of the block size, For instance, suppose we write to a block device like /dev/sda1 one byte at a time. Each time we write a byte, an entire block is edited in-memory to change that byte, and then the entire block is flushed out to the device, usually asynchronously. By writing bytes, we risk reduced performance: that the same block of the device will be wastefully dirtied and flushed two or more times. However, it's very unlikely that the buffer sizes used by standard utilities like "cp" are not good multiples of a block size. Block sizes are almost always powers of two, and buffers in file copying utilities are also, and larger than typical block sizes. dd has features that are not found in other utilities, such as seeking into arbitrary positions in the source and destination and copying only certain amounts. dd can also work with devices that are infinite sources of bytes; with dd you can read 1024 bytes from /dev/urandom, which can't be done with cat or cp. If you need to do any of these things, you need dd, or something like it.
Re: Regarding compilation of coreutils.
On 2020-01-06 11:53, Sandeep Kumar Sah wrote: previously i edited ls.c to print "Hello World" before listing content in a directory. Now i have deleted the coreutils folder and everything underneath it. I want to get the original version of ls command for which i am unable to build the source file, it tells me that "checking for a BSD-compatible install... /usr/local/bin/install -c checking whether build environment is sane... configure: error: ls -t appears to fail. Make sure there is not a broken alias in your environment configure: error: newly created file is older than distributed files! Did you install this modified ls into your /bin? Or is it in some non-system location that happens to be listed in your PATH? If you didn't clobber your system ls, so this is just a PATH issue, either edit PATH, or find out where this modified ls is and remove/rename it. If you clobbered your /bin/ls, you may be able to use your GNU/Linux distro's packaging system to refresh the installation. Assuming you added something like: printf("Hello, World\n"); to the code, then you can edit /bin/ls with a binary editor, such as, oh, "vim -b /bin/ls". Find the "Hello World" string, and overwrite the "H" with a null byte to reduce it to zero length. Save the executable and try it. If it's something like puts("Hello, World"); where the newline is implicit in the function behavior, you may have to find the instructions which make this call and overwrite them with NOP (byte value 0x90 on Intel x86, IIRC). Other ideas/hacks: - Copy a working /bin/ls from another system that is identical or similar to yours. E.g. say you're on 64 bit Ubuntu 18. If you happen to have 64 bit Ubuntu 16, that system's /bin/ls should work. - Go into the Coreutils configure system and try to defeat the test for a working "ls -t". Maybe the result of the test is not needed for the sake of building a working ls. - Rename the funny ls binary to ls-funny, and write a /bin/ls shell script wrapper which calls ls-funny "$@", and filters out the Hello, World first line of output, as in something like: #!/bin/sh /bin/ls-funny "$@" | sed -n -e '2,$p' - Absolute last resort of the utter coward: Boot some rescue DVD-ROM. Mount your install partition and copy the live system's /bin/ls into your install partition's /bin/ls.
Re: [PATCH] sleep: allow ms suffix for milliseconds
On 2019-12-08 21:46, sunnycemet...@gmail.com wrote: On 2019-12-02 13:58, Stephane Chazelas wrote: With GNU coreutils sleep (and ksh93's builtin but not that of bash or mksh) one can add a e-3 suffix to get miliseconds (and e-6 for us and e-9 for ns) sleep 1 # s sleep 1000e-3 # ms sleep 100e-6 # us sleep 10e-9 # ns Thank you for the trick (and Berny for the documentation patch). It's new to me, but I guess that's what I get for not investigating the info page's notes. Though it's a nice trick, it obviously depends on the value not having an E exponent already. When that can be assumed or assured, it's useful, no doubt.
Re: [PATCH] sleep: allow ms suffix for milliseconds
On 2019-11-29 09:38, Bernhard Voelker wrote: On 2019-11-29 14:30, Rasmus Villemoes wrote: When one wants to sleep for some number of milliseconds, one can do (at least in bash) sleep $(printf '%d.%03d' $((x/1000)) $((x%1000))) but that's a bit cumbersome. Why not use floating-point numbers directly? $ sleep 0.01234 I think the point is that the above example is doing exactly that, but it has to convert from a value x which is in milliseconds. The shell has only integer arithmetic, so a clumsy expression is required. If the shell had floating arithmetic, it would just be this: sleep $((x / 1000)) With GNU dc we can do: sleep $(dc -e "3k $x 1000/p") Calling sleep(1) with a small milliseconds argument seems anyway a very rough hammer, because the overhead to launch the executable is larger than the actual nanosleep(2) system call. Well, nobody says that the x value is in the range [0, 1000). Sleeping for 15500 milliseconds is valid. But in any case, we can already do that with "sleep 15.500". The issue is that it's cumbersome to convert from 15500 to 15.500 in a shell script. That's the problem to fix. Next time someone will need another such conversion in another context, and then yet another; we can't be adding units suffixes into every utility that takes numeric arguments. Fix the right problem in the right place. That goes especially for issues that aren't blockers; there is no urgency to address this problem with a quick fix like "sleep 123m" because the cumbersome shell code works fine.
Re: [PATCH] sleep: allow ms suffix for milliseconds
On 2019-11-29 05:30, Rasmus Villemoes wrote: * src/sleep.c: Accept ms suffix. * doc/coreutils.texi (sleep invocation): Document it. When one wants to sleep for some number of milliseconds, one can do (at least in bash) sleep $(printf '%d.%03d' $((x/1000)) $((x%1000))) but that's a bit cumbersome. Extend sleep(1) to also accept "ms" as a suffix, so one can instead do sleep ${x}ms Could be worth it to accept a few other suffixes like "us" and "ns".
Re: question about SI/IEC in df
On 2019-11-28 10:16, Kaz Kylheku (Coreutils) wrote: But, let me remark, using KB, MB, for powers of 1000 is neither metric, nor grounded in tradition. If it's all caps like KB and MB, it's clearly 1024-based just like without the B. There has to be a Sorry about that, this is flatly wrong; lower case b is bits: kb is a kilobit.
Re: question about SI/IEC in df
On 2019-11-28 04:39, Krzysztof Labus wrote: In the manual I see: The SIZE argument is an integer and optional unit (example: 10K is 10*1024). Units are K,M,G,T,P,E,Z,Y (powers of 1024) or KB,MB,... (powers of 1000). 1. Why df not using Ki, Mi, Gi etc. in powers od 1024 ?? - Wastes space. - Flouts tradition. - Scripts in the wild depend on the details of utility output; don't mess with it. - It's ultimately "bike shedding". But, let me remark, using KB, MB, for powers of 1000 is neither metric, nor grounded in tradition. If it's all caps like KB and MB, it's clearly 1024-based just like without the B. There has to be a lower-case b, and proper casing of the scale: k M g t p. M is capitalized because m stands for milli, but the b won't be capitalized, hence Mb. References: https://en.wikipedia.org/wiki/Kilobyte "The internationally recommended unit symbol for the kilobyte is kB." "In some areas of information technology, particularly in reference to digital memory capacity, kilobyte instead denotes 1024 (210) bytes. This arises from the powers-of-two sizing common to memory circuit design. In this context, the symbols K and KB are often used."
Re: feature request du/find
On 2019-10-30 13:14, Benjamin Arnold wrote: Hi, thanks a lot for your quick response. Sorry, i must have missed the -links option, that's exactly what i am looking for. Unfortunately, it's a component in an incorrect solution. A file tree being backed up can contain hard links (e.g. two executables in /usr/bin being the same file). The general condition we must look for is that if the tree has N directory entries pointing to the same object O, then O is entirely contained in that tree if its link count is N. Otherwise the count must be > N, and there are links to it elsewhere. A static -links predicate in find will not do it. In my case du would have counted "twice", because the other hard link is not in the directory du is searching in. I think this should be a feature of do; and likely du has most of the pieces in place to make this easy. It already identifies multiply linked objects. All it needs is a flag which will cause it to disregard the size of any any object which has more links than the number of times du has encountered that object. The obvious algorithm will have an effect on the order in which du reports objects. When the option is in effect, du will show the path which references the *last* occurrence of the object (in the traversal order). E.g. if some object with link count = 3 is processed, the first two appearances of it won't be reported and counted, but when the third one is seen, du can be sure that there are no other references and can then tally the object's size and report on it. This algorithm will naturally cull the objects with excessive link counts: the condition for reporting them and adding them to the total simply isn't reached.
Re: How to implement the V comparsion used by sort in python?
On 2019-10-26 16:05, Peng Yu wrote: Are you sure they are 100% compatible with V? I don’t want to use them just later find they are not 100% compatible. "are you sure various Python packages are compatible with sommething vaguely described in a some years-old obscure blog post" doesn't seem like a great question for the Coreutils mailing list. Try a Python forum?
Re: Does head util cause SIGPIPE?
On 2019-10-25 00:56, Ray Satiro wrote: Recently I tracked down a delay in some scripts to this line: find / -name filename* 2>/dev/null | head -n 1 (Here 'filename*' should be quoted, because we want find to process the pattern, not for the shell to try to expand it.) Interestingly, POSIX neglects to say whether head is required to quit after dumping the required number of lines, or whether it terminates (thereby abruptly closing its standard input, possibly causing a broken pipe error in the upstream process). (Of course head can be given two or more file arguments, in which case of course it doesn't quit until the last one is processed.) In a RATIONALE paragraph, though POSIX says that head, for a single file, could be simulated using "sed 10q"; and that *will* quit immediately and break the pipe. It appeared that the searching continued after the first result was found. I expected head would terminate and SIGPIPE would be sent which would cause find to terminate, but that did not happen. Since I was in cygwin I thought maybe it was a Windows issue but I tried in Ubuntu 16 SIGPIPE not working in Cygwin cannot possibly be a "Windows issue", since Windows has no such thing as SIGPIPE; it would be a Cygwin issue. with an aribtrary file and noticed the same thing. I don't see it in Unbuntu 16 or 18 at all. "find / | head" shows ten lines, and after that, there is no evidence of any find process continuing to execute. If I run: find / | head && ps -aux | grep find the grep process only finds itself in the output of ps; I tried about 20 times in a row, hoping to catch a briefly lingering find, but nothing.
Re: Compile Coreutils without xattr but i installed
On 2019-10-10 11:56, Wei MA wrote: I compile the source code. And when i ran tests/cp/capabiliy.sh, cp preserves attr failed without xattr support . Then i installed xattr. I deleted coreutils and downloaded it again. The problem still exists. A configure problem likely won't be due to a bad copy of coreutils; you have to debug where that is going wrong: why it isn't detecting the xattr. It looks like the Coreutils configure script looks for two headers: and . It also looks for a function attr_copy_file in libattr. See: http://git.savannah.gnu.org/cgit/coreutils.git/tree/m4/xattr.m4 I use Ubuntu 18. When i ran cp of Ubuntu, the same commands has no problem. A possibility may be to find the Ubuntu 18 build recipe for coreutils and find out what it's doing differently from you. Does it pass something to the configure script or force any Autoconf variables (ac_cv_whatevers). Does it apply any patches, etc.
Re: md5sum and recursive traversal of dirs
On 2019-10-10 10:29, Сергей Кузнецов wrote: Hello, I find it strange that md5sum does not yet support recursive directory traversal. I moved part of the code from ls and added this functionality. How about adding this? I also added sha3 and md6 algorithms, they are in "gl/lib/". If we have any utility whatsoever that operates on files, sometimes we want to apply it to every file in a tree. It does not follow that every utility whatsoever that operates on files should integrate the code for traversing a tree. We have ways in the shell, and in other programming languages, to map any operation over a tree of files. The mapping mechanism maps, the MD5 mechanism calculates MD5 sums; each has a single responsibility. One noteworthy tree traversal mechanism appears as an extension in the GNU Bourne-Again Shell (Bash). In Bash, if you set the "globstar" option like this: shopt -s globstar' If this is enabled, then the ** operator becomes active in file globbing patterns. The ** operator spans across multiple path components. For instance: # calculate the md5sums of all .so files anywhere in /usr/lib md5sum /usr/lib/**/*.s By the way, I wrote two new small programs: xchg (or swap, which name is better?) And exst (exit status). Exchanging two files can be implemented as a shell function, which can be extremely simple if we don't worry about exchanging files in different filesystem volumes. Here is a sketch: swap() { local tmpname=$(mktemp swap-XX) # ... check arguments here for count and sanity ... mv -- "$1" $tmpname mv -- "$2" "$1" mv -- $tmpname "$2" } The first program simply changes files, the number of which can be more than two. That should proably be called "rotate", like the rotatef operator in Common Lisp. The logic becomes something like (untested): # if we have at least two arguments: if [ $# -gt 1 ] ; then mv -- "$1" $tmpname # while we have two or more arguments while [ $# -gt 1 ] ; do mv -- "$2" "$1" shift done # last argument gets $tmpname mv -- $tmpname "$1" fi Example: rotate some logs: rotate deleteme log.2 log.1 log.o log rm deleteme Undoubtedly elegant and useful; but should it be a C program in GNU Coreutils? Hardly. The second program launches the program indicated at startup and, after its completion, prints the output status or the caught signal. Doable in shell scripting, again. The status of the last command is available in the $? variable. This can be tested: stat=$? if [ $(( stat & 0x80 )) != 0 ] ; then printf "terminated due to signal %d\n" $((stat & 0x7F)) else printf "exited with status %d\n" $stat fi Bash has "massaged" the value already. That is to say, if the program terminates normally with an unsuccessful status 19 we don't have to do any shifting to recover the value from the upper bits of an exit status word; $? simply holds the value 19.
Re: Can natural sort support be added?
On 2019-10-08 00:47, Peng Yu wrote: Hi, Since natural sort is provided in a few languages (as mentioned in the Wikipedia page). Can it be supported by `sort` besides just version-sort? https://en.wikipedia.org/wiki/Natural_sort_order This page has no precise specific definition of natural sort order. Its external references have poor credibility, consisting of a blog entry at "codinghorror.com" and links to some implementations of something in Perl, PHP, Python and Matlab. A link to some IETF RFC or ISO standard or other major document is required.
Re: [PATCH] chown: prevent multiple --from options
On 2019-09-29 02:46, Francois wrote: We can fix by rejecting the cases where --from option is provided multiple times and uid or gid are set twice. An more sophisticated fix is to allow the --from to be given multiple times, but have the resulting range be the intersection of all of the ranges given. Each successive --from applies clipping to the range calculated so far. If a uid or gid are given twice, but match, that should be fine too; why not.
Re: Wishing rmdir had a prompt
On 2019-09-02 01:03, Sami Kerola wrote: I am not a maintainer, but I don't see any problem adding --interactive long only option. Getting a short option may clash with future posix requirement, so I believe they are not handed out without really good reasons. Fear not; POSIX standardization is not ignorant of significant implementations like those from GNU. For example, here is a literal quote from Issue 7 (2018) POSIX's awk page: "The undefined behavior resulting from NULs in extended regular expressions allows future extensions for the GNU gawk program to process binary data." https://pubs.opengroup.org/onlinepubs/9699919799/utilities/awk.html "GNU -> GNU Now (mentioned in) Unix" :)
Re: Wishing rmdir had a prompt
On 2019-09-01 17:50, Leslie S Satenstein via GNU coreutils General Discussion wrote: rmdir -i I don't see this in a fairly recent GNU Coreutils 8.28 installation. Must be very new? There is some justification for such a thing. Though it may seem that accidental deletion of empty directories is easy to restore by recreating them, that does not put the file system back in the pre-deleted state. The modification timestamps of parent directories are tweaked, and new empty directories come up with their own modification timestamps as well as different inode numbers, permissions, group and user ownerships and such. There is no substitute to not deleting something; what comes close is recovering it from a backup.
Re: /bin/echo -- $var
On 2019-08-15 00:53, Harald Dunkel wrote: IMHO they should have kept the "no args allowed" for echo ("in the late 70s") and should have introduced a new tool "eecho" instead. Well, new tool for printing was introduced under the name "printf".
Re: /bin/echo -- $var
On 2019-08-14 05:01, Harald Dunkel wrote: Hi folks, I just learned by accident that var="-n" /bin/echo -- $var actually prints -- -n Shouldn't it be just -n ? According to POSIX, echo doesn't take options. It is specified that "Implementations shall not support any options." (We have options, though, so things are complicated.) Furthermore, the specification explicitly speaks of -- thusly: "The echo utility shall not recognize the "--" argument in the manner specified by Guideline 10 of XBD Utility Syntax Guidelines; "--" shall be recognized as a string operand."
Re: How to convert a md5sum back to a timestamp?
On 2019-07-31 20:36, Peng Yu wrote: Hi, Suppose that I know a md5sum that is derived one of the timestamps computed below. Is there a way to quickly derive what the original timestamp is? I could make a database of all the timestamps and their md5sums. But as the total number of entries increases, this solution will not be scalable as the database can be big. Is it there any better solution to this problem? for i in {1..2563200}; do date -d "-$i minutes" +%Y%m%d_%I%M%p; done The solution to this is to back up several levels in whatever you are working on, and restructure the approach to the real problem in such a way that the flowchart box which says "And here we just crack MD5 sums" is somehow eliminated.
Re: Possible ls bug?
On 2019-02-26 13:10, Bartells, Paul wrote: I have encountered behavior with ls -R that appears to be incongruous. My actual command line entry is: ls -alR /kyc_mis/dev/*/*/paul/* > pauldev.lst. [ ... ] /kyc_mis/dev/rpts/paul/kyc: total 599 -rwxrwx--- 1 pb82477 kycmis 262144 Oct 31 17:06 kyc_excepreport_old_20181022.sas7bdat drwxrwx--- 10 pb82477 kycmis176 Oct 24 16:14 .. drwxrwx--- 2 pb82477 kycmis 55 Oct 31 17:06 . What is odd here is that name of the directory being listed doesn't match the pattern /kyc_mis/dev/*/*/paul/ (assuming we can trust that line of the output to be true: ls is really listing a directory by that name). It has a "kyc" component where the pattern expects "paul". You haven't told "ls" to follow symlinks, either. Unless the filesystem has cycles, or ls has started following ".." links, we should not be ending up in this directory from any starting point that matches the command line pattern.
Re: [PATCH] df: Adding a --no-headers option, by request of Bruce Dubbs
On 2019-03-17 05:27, Ed Neville wrote: Taking suggestions into account, '--no-headers' seems more consistent with ps options. This loses on character count: df --no-headers df | sed 1d Fully golfed: df|sed 1d Oops!
Re: FAQ confusing terminology regarding GNU and Linux Relationship
On 2018-10-16 18:58, fdvwc4+ekdk64wrie5d8rnqd9...@guerrillamail.com wrote: Under the section in the FAQ about uname, it refers to ``the Linux kernel." Is not the GNU position that Linux should be referred to as ``Linux, the kernel' or something similar? The GNU position is that an operating system distribution consisting of GNU programs and libraries on top of Linux shouldn't just be called "Linux" because that's just the name of the kernel. The appositive expression "Linux, the kernel" is misleading, in fact, because it insinuates that Linux needs to be qualified in this manner, because there is some relevant Linux that isn't a kernel. Linux, the kernel---as opposed to Linux, the what? It only makes sense in a sentence like "we're talking about Linux, the kernel, not Linux, the Swiss laundry detergent".
Re: RFC: rm --preserve-root=all to protect mount points
On 2018-06-10 23:14, Pádraig Brady wrote: I was asked off list to consider adding an option to rm that could be enabled with an alias, and would protect mount points specified on the command line. [...] $ rm -r --preserve-root=all /dev/shm rm: skipping '/dev/shm', since it's a mount point rm: and --preserve-root=all is in effect The command option is well-named, but consider changing "mount point" to "mount" in this diagnostic and, more importantly, any documentation which refers to this. E.g. "since a filesystem is mounted there", "since it is a filesystem root", etc. I think the "mount point" terminology is misleading because one important sense of the word is that it refers to the Unix kludge of requiring an empty directory to exist for a mount. The empty directory where one intends to mount a filesystem is the "mount point" for it. This option cannot protect directories which are mount points in that sense; only ones that are carrying mounts.
Re: mv --recursive
On 2018-06-01 04:08, Grady Martin wrote: Hello. I have two questions: · Is there a way to recursively merge two directories with move (not copy/delete) operations, using standard GNU utilities? · If not, how do coreutils' maintainers feel about an -r/-R/--recursive patch for mv? We can almost do this already, with cp, except that the files also remain a the source. ■ mv -R old new ■ ls -R I.e. cp -rl old/. new/. The new/ tree is populated with hard links to corresponding objects in old, which is what mv will do (on the same filesystem, anyway). Basically, if cp had an option called "--remove-source", which does what its name says, I think it would do what you want. cp itself could optimize using hard linking when that option is specified, and the source and destination directories are on the same filesystem, which supports linking. cp with --remove-source would just about obsolesce mv.
Re: performance bug of `wc -m`
On 2018-05-20 16:43, Bruno Haible wrote: Kaz Kylheku wrote in https://lists.gnu.org/archive/html/coreutils/2018-05/msg00036.html : In what situation are there printable characters in the range [0, UCHAR_MAX) that have a width > 1? That's the wrong question. The question is which characters in this range have width > 1 or <= 0. The program below shows that the answer (on a glibc system) is: The character 0x00AD (= SOFT HYPHEN) is printable but has width == 0. I tried printing this on several terminals; all actually render something that is one character position wide. A program which calculates column positions on a terminal will be wrong if 0xAD has been printed, and it relies on this bogus datum from glibc.
Re: performance bug of `wc -m`
On 2018-05-13 15:05, Philip Rowlands wrote: In the slow case, wc is spending most of its time in iswprint / wcwidth / iswspace. Perhaps wc could learn a faster method of counting utf-8 (https://stackoverflow.com/a/7298149); this may be worthwhile as the trend to utf-8 everywhere marches on. I can't explain without more digging why Python's string decode('utf-8') is better optimised for length calculations. On the surface, it seems easy to explain: the Python program is just decoding UTF-8 and then taking the length. None of that requires character classification and determination of display width. If "wc -m" is doing something with display with, it's very different from what the Python is doing. What are the requirements underpinning "wc -m", and how do these iswprint and iswspace functions fit into it? POSIX says this: "The -c option stands for "character" count, even though it counts bytes. This stems from the sometimes erroneous historical view that bytes and characters are the same size. Due to international requirements, the -m option (reminiscent of "multi-byte") was added to obtain actual character counts." I don't see how this amounts to having to call iswspace and all that. Nowhere does POSIX say that the display width of a character has to be obtained in "wc" and I don't see that in the GNU documentation either.
Re: performance bug of `wc -m`
On 2018-05-16 17:13, Eric Fischer wrote: I also found wcwidth to be a bad performance bottleneck in my multibyte branch of coreutils. To fix the problem in my branch, I added a cache of the widths returned for characters in the range from 0 to UCHAR_MAX (which perhaps should also be widened to include a few other common alphabets). The caching code is at the bottom of In what situation are there printable characters in the range [0, UCHAR_MAX) that have a width > 1? The lowest-numbered Unicode character that requires two spaces is U+1100, I think.
Re: is there a real escape "quoting" style for ls?
On 2018-05-13 09:30, Harald Dunkel wrote: On 5/13/18 1:08 PM, L A Walsh wrote: If you look under --quoting-style, you'll see: --quoting-style=WORD use quoting style WORD for entry names: literal, locale, shell, shell-always, shell-escape, shell-escape-always, c, escape I haven't verified, but it looks like one of the options with the word 'shell' in it might be more in line w/what you want... Maybe you should. c "A Knight's Tale: Part 2" escape A\ Knight's\ Tale:\ Part\ 2 literal A Knight's Tale: Part 2 locale 'A Knight\'s Tale: Part 2' shell "A Knight's Tale: Part 2" shell-always"A Knight's Tale: Part 2" shell-escape"A Knight's Tale: Part 2" shell-escape-always "A Knight's Tale: Part 2" bash command line completion gives me one of A\ Knight\'s\ Tale\:\ Part\ 2 "A Knight's Tale: Part 2" 'A Knight'\''s Tale: Part 2' The colon character doesn't require escaping for the purposes of command line processing; the character has no special meaning in the shell syntax. (Of course there is a : command, but that's not via special treatment of the character.) Bash's completion, however, assumes that unescaped colons are separators of PATH-like lists. If you have a file called foo:bar and you type echo foo:b[Tab] it will not complete on it; it treats foo:bar as a PATH-like list of two independent items, and tries to complete on just the "b". You will have to type foo\:[Tab] to get the foo\:bar completion, or "foo:[Tab] But that escape is not actually necessary for the processing of the command line. It makes no difference: the word foo\:bar produces the same argument as foo:bar. Ever the burning question: what are you trying to do? How are you blocked from doing that by colons not being escaped in the output of ls? Are you trying to copy and paste a *partial* escaped filename from the output of ls and then Tab-completing on it? In that case, sure, this style will not do: $ A\ Knight's\ Tale:\ [Tab] But this style will work: $ "A Knights' Tale: [Tab]
Re: Difference in binaries present in old and new versions of gnu tools
On 2018-05-02 23:23, Mathai, Eldho (Nokia - IN/Bangalore) wrote: After the make install we could see many binaries are missing in the latest when compared with our existing old version. Can you help me here to know why these binaries are missing and where can I get the latest versions of these missing binaries. Do you honestly think that "mysql_upgrade_shell" and "omniidlrun.py" are GNU core utilities???
Re: env: add -S option (split string for shebang lines in scripts)
On 2018-04-24 22:09, Pádraig Brady wrote: I was thinking that the explanation of -S in usage() would say something like: -S, --split-string=S process and split S into separate arguments used to pass multiple arguments on shebang lines One little problem with this FreeBSD design is that in spite of supporting variable substitution, it lacks the flexibility of specifying where among those arguments the script name is inserted! Say we have a "#!/usr/bin/env -S lang ..." script called "foo.lang". Suppose we want it so that "lang -f foo.lang -x" is invoked, where -f is the option to the lang interpreter telling it to read the script file foo.lang, and -x is an option to the foo.lang script itself. It would be useful if the -S mechanism could indicate this insertion of foo.lang into the middle of the arguments. The variable expansion mechanism could do it. Suppose that a certain reserved variable name like ${ENV_COMMAND} (not my actual suggestion) expands to the name of the last argument received by env. Furthermore, when env is asked to expands this variable one or more times, it makes a note of this and then sets a flag which suppresses the script name from appearing in its usual position at the end. Then this is possible: #!/usr/bin/env -S lang -f ${ENV_COMMAND} -x
Re: Multibyte support for sort, uniq, join, tr, cut, paste, expand, unexpand, fmt, fold, and pr
On 2018-03-20 15:18, Assaf Gordon wrote: Two things for later (not critical for now): to make review easier, it's recommended to combine all commits that relate to a single program into one commit. This is called "squash" in git (see: http://gitready.com/advanced/2009/02/10/squashing-commits-with-rebase.html All little commits that achieve one logical change should be squashed. For instance, don't have 239d4f9 foo: implement feature X. 3df77ab foo: fix missing semicolon in new X. 93df301 foo: fix null pointer deref due to X. These little incremental fixes in the development feature X shouldn't be published as separate changes; only the polished, debugged, reviewed "feature X". However, changes pertaining to different development "topics" should never be squashed into one commit. Two commits being to the same program in GNU Coreutils are not automatically the same topic. E.g. these would be wrong to squash together: 100df03 ls: implement quoting for whitespace. 69d34d0 ls: fix bad indentation in several functions. Review is certainly not easier when multiple changes are combined. I don't want to review some change in logic, under the distraction of numerous whitespace changes, or changes in unrelated logic. https://blog.carbonfive.com/2017/08/28/always-squash-and-rebase-your-git-commits/ ). That is just lunacy. Certainly, you should cleanly cherry pick or rebase all changes onto a single mainline without a crazy merge graph: that much of it is true. Squashing all changes is poor. "Patch bombs" (big changes that combine multiple topics in one diff) will not pass review in any shop that understands version control.
Re: ls is broken, what's next cd?
On 2018-02-06 01:30, Michael wrote: On 06/02/2018 08:13, Bernhard Voelker wrote: On 02/06/2018 12:41 AM, Michael Felt wrote: imho, the main problem is you change the default behavior, and 43 years of programs are broken. no, as explained it does not affect programs and scripts, because this only changes the output to the terminal. Yes, I thought about that too. So, maybe I would have liked the choice to be able to have them quoted IF and/or WHEN I needed to cut/paste names. But now I must either not install coreutils (as I have that option) or always remember to add three characters (' -N') everytime I want normal ls output. Are you saying that even names without spaces are being quoted? If you only see these quotes for idiotic file names, then there is really no issue. Nobody should even listen to your complaint, because it is prompted by the fact that you have such names in your filesystem, which automatically makes you wrong in that same face of 43 years of Unix alluded to upthread. I'm ideologically opposed to this -N thing myself, or anything which caters to these ill-conceived file names. However, practically speaking, sometimes professionals who do not themselves name things that way can fall "victim" to people who do. If you have to deal with someone else's filesystem or tarball or whatever, it does behoove you if your ls disambiguates things for you.
Re: ls is broken, what's next cd?
On 2018-02-05 07:18, Андрей Кинзлер wrote: Hi, After upgrading my distro, I was quite surprised when I saw some of my files wrapped with single quotes. Why adding such a useless feature of an unnecessary garbage to the terminal output? Were you inspired by Microsoft when fixing something that wasn't broken and now the files are totally misaligned when you type 'ls'? +1 There is no need for this pointless garbage. Unix has gotten along without quoting the output of ls for 43 years now. Programs which parse the output of "ls" are broken. They are also anachronistic; fixing the output of "ls" amounts to a solution that someone needs in 1987, not in 2018. Any halfway decent scripting language nowadays gives you some access to readdir; if not via its library than FFI. Plus utils for globbing, walking the file system and so on. Never mind halfway decent scripting language; even in POSIX shell scripts, there is no need to read the output of ls.
Re: Why is `find -name '*.txt'` much slower than '*.txt' on glusterfs?
On 2018-01-19 20:26, Peng Yu wrote: Hi, There are ~7000 .txt files in a directory on glusterfs. Here are the run time of the following two commands. Does anybody know why the find command is much slower than *.txt. Is there a way to change the api that `find` uses to search files so that it can be more friendly to glusterfs? A wild guess: find is calling stat on every directory entry that it reads? What do you see if you run these commands under "strace"? On GNU/Linux, programs that search through directories can avoid calling stat in many cases by taking advantage of the "d_type" field in "struct dirent". Maybe this doesn't work on glusterfs? The *.txt syntax (that specific case of it) doesn't have to stat any inodes because it just cares about the names whether they are directories or other objects.
Re: Would a patch adding color to cat(1) be accepted?
On 10.10.2017 07:03, Leslie S Satenstein wrote: My RESPONSE KISS Hey, why not? Next year, everyone's embedded system will have twice the flash. Then they can stop using BusyBox and switch to Coreutils with colorized everything!
Re: cp, ln, mv, install: check for vulnerable target directories
On 21.09.2017 11:03, Kaz Kylheku (Coreutils) wrote: On 21.09.2017 09:18, Kaz Kylheku (Coreutils) wrote: On 20.09.2017 18:59, Paul Eggert wrote: Kaz Kylheku (Coreutils) wrote: Instead of checking for what *could* go wrong, why not defend more specifically against signs that the attack might be actually happening. That's what the patch is trying to do, though it looks like it should be improved. There is a simple operating system fix for this: do not allow processes to create symlinks in directories to which they only have write accesses via S_IWOTH. Two additional notes: Rather than a hard-coded behavior, this could be a "nolink" mount option, somewhat analogous to "nodev" (deny use of device nodes present in the filesystem). I completely missed the full value latent in this analogy. Just like "nodev" doesn't prevent creation of device nodes with mknod, but is aimed at curtailing their *use*, the proposed "nolink" mount option could similarly prevent the *traversal* of symlinks created in a shared directory rather than blocking the means of their creation. Suppose user mallory creates a symlink in a directory where multiple non-root users have write access. Then have it so that only mallory can follow the symlink (being the owner of the link). A symlink owned by mallory in a directory that is writable to alice shall not be dereferenceable by alice; there will be an EPERM when alice tries to traverse it. (If the filesystem is mounted "nolink".) The permission denial would have to apply, of course, not only when a new symlink is created via the symlink system call, but also to: That takes care of the problem of trying to police various ways in which a symlink can sneak in. Let the symlink appear by any means, and just render it inoperable. The criterion used by "nolink" could be that if a directory is being traversed and the next path component is a symbolic link found in that directory, then the traversal is allowed if either: * the owner of the link and owner of the directory are the same UID; or * the owner of the link is the same as the effective UID of the caller. This is applied regardless of the permissions on the directory. Otherwise the traversal is denied with EPERM. All other existing considerations for the use of the symlink continue to apply also. Essentially, a directory in a "nolink" mounted FS has to "vouch for" a child symlink by having the same security attribute with regard to ownership, or else the symlink has to be the caller's own item.
Re: cp, ln, mv, install: check for vulnerable target directories
On 21.09.2017 09:18, Kaz Kylheku (Coreutils) wrote: On 20.09.2017 18:59, Paul Eggert wrote: Kaz Kylheku (Coreutils) wrote: Instead of checking for what *could* go wrong, why not defend more specifically against signs that the attack might be actually happening. That's what the patch is trying to do, though it looks like it should be improved. There is a simple operating system fix for this: do not allow processes to create symlinks in directories to which they only have write accesses via S_IWOTH. Two additional notes: Rather than a hard-coded behavior, this could be a "nolink" mount option, somewhat analogous to "nodev" (deny use of device nodes present in the filesystem). The permission denial would have to apply, of course, not only when a new symlink is created via the symlink system call, but also to: * an attempt to move an existing symlink into a directory where the caller has write permission only via S_IWOTH. (The rename system call has to check and enforce this). * an attempt to duplicate a symlink into a directory via hard linking. (The link system call has to check and enforce). * any other situation: overlaid directories? (In consideration of whether a malicious symlink could be perpetrated in situations in which a shared directory is formed by overlaying via unionfs, overlayfs and their ilk, and the attacker is able to create symlinks in some of the underlying directories even though such an attempt is blocked in the assembled directory.)
Re: cp, ln, mv, install: check for vulnerable target directories
On 20.09.2017 18:59, Paul Eggert wrote: Kaz Kylheku (Coreutils) wrote: Instead of checking for what *could* go wrong, why not defend more specifically against signs that the attack might be actually happening. That's what the patch is trying to do, though it looks like it should be improved. There is a simple operating system fix for this: do not allow processes to create symlinks in directories to which they only have write accesses via S_IWOTH. More precisely, the proposal is that if a process want to create a symlink, then it either has to be root, or else the owner of the directory with S_IWUSR asserted on the directory, or else the group owner (directly or via a supplementary GID) with S_IWGRP asserted. For the purposes of creating a symlink, the directory is treated as if S_IWOTH is false, even if set. The main use case for shared writable directories is /tmp and "spool" directories. I can't think of a legit reason to be creating symlinks in those directories, only subdirectories (in which the creator then make symlinks), regular files, and some special objects like sockets. A symlink in a shared writable directory is nothing more than a "name squatting" trap. Ergo, don't allow that. Or else, the responsibility for defense then spreads all over the system, such as into basic utilities!
Re: cp, ln, mv, install: check for vulnerable target directories
On 19.09.2017 00:25, Paul Eggert wrote: For years cp and friends have been subject to a symlink attack, in that seemingly-ordinary commands like 'cp a b' can overwrite arbitrary directories that the user has access to, if b's parent directory is world-writable and is not sticky and is manipulated by a malicious user. Also, it occurs to me that the attack can be perpetrated if any of the ancestral directories are writable to another non-root user. Suppose we have cp passwd /alpha/beta/gamma/delta/omega If the attacker can write to alpha, the attacker can create a symlink in a path like this: /home/attacker/beta/gamma/delta/omega -> and, having write access to /alpha, the attacker can replace the /alpha/beta directory with this symlink: /alpha/beta -> /home/attacker/beta
Re: cp, ln, mv, install: check for vulnerable target directories
On 19.09.2017 00:25, Paul Eggert wrote: For years cp and friends have been subject to a symlink attack, in that seemingly-ordinary commands like 'cp a b' can overwrite arbitrary directories that the user has access to, if b's parent directory is world-writable and is not sticky and is manipulated by a malicious user. From patch: PE> +environment variable.) For example, if @file{/tmp/risky/d} is a PE> +directory whose parent @file{/tmp/risky} is is world-writable and is PE> +not sticky, the command @samp{cp passwd /tmp/risky/d} fails with PE> +a diagnostic reporting a vulnerable target directory, as an attacker PE> +could replace @file{/tmp/risky/d} by a symbolic link to a victim PE> +directory while @command{cp} is running. In this example, you can PE> +suppress the heuristic by issuing one of the following shell commands PE> +instead: Instead of checking for what *could* go wrong, why not defend more specifically against signs that the attack might be actually happening. Somehow detect, "Uh oh! Parent is writable by another non-root user, and the last component opened through a symlink!" while carefully guarding against race conditions that could render such a defense tactic less than fully effective.
RE: How to submit my utility for inclusion in coreutils?
On 02.09.2017 07:29, Quiroz, Hector wrote: Thanks for your reply. I will write it in c then I will ask again. Thanks You would be doing something pretty silly: taking a working Perl script which performs adequately and rewriting it in C just so that it can potentially be included in some project that requires everything to be written in C. It's coding to solve a political/ideological issue, not a technical one.
Re: New utility suggestion: chdir(1)
On 26.08.2017 11:10, Colin Watson wrote: I would like there to be an adverbial version of "cd", which takes a path followed by a command and optional arguments and executes the command with its working directory set to the given path. Its invocation would be similar to chroot(8), that is: chdir [OPTION] NEWDIR [COMMAND [ARG]...] Could be an option in "env". env -C /path/to/dir VAR=value ... command arg (-C follows "tar -C' and "make -C").
Re: New utility suggestion: chdir(1)
On 26.08.2017 11:10, Colin Watson wrote: sudo chroot /path/to/chroot sh -c 'cd /foo && ls -l' The -c option is not the only way to pass a script to the shell. You can also pipe it in. This means dealing with shell quoting, which is tedious and error-prone. sh <<'end' echo 'hello, world' end
Re: sort -V behaviour
On 31.07.2017 09:23, Sven C. Dack wrote: Hello, I have a question about the -V option to sort, but first some examples: $ echo -e "1\n1.2\n1.2.3\n1.2.3.4"|sort -V 1 1.2 1.2.3 1.2.3.4 $ echo -e "f1\nf1.2\nf1.2.3\nf1.2.3.4"|sort -V f1 f1.2 f1.2.3 f1.2.3.4 $ echo -e "/1\n/1.2\n/1.2.3\n/1.2.3.4"|sort -V /1 /1.2 /1.2.3 /1.2.3.4 $ echo -e "1f\n1.2f\n1.2.3f\n1.2.3.4f"|sort -V 1f 1.2f 1.2.3f 1.2.3.4f Note that this also has a problem, though the behavior is what you expect, so you don't notice. Here, only the last three lines of input contain version numbers. In each one, the last dot and everything after it is considered a suffix; the versions being sorted are "1", "1.2" and "1.2.3". $ echo -e "1/\n1.2/\n1.2.3/\n1.2.3.4/"|sort -V 1.2.3.4/ 1.2.3/ 1.2/ 1/ My question is, why does the -V option reverse the order in the last case? From the info documentation: Version-sorted strings are compared such that if VER1 and VER2 are version numbers and PREFIX and SUFFIX (SUFFIX matching the regular expression `(\.[A-Za-z~][A-Za-z0-9~]*)*') are strings then VER1 < VER2 implies that the name composed of "PREFIX VER1 SUFFIX" sorts before "PREFIX VER2 SUFFIX". Looks the SUFFIX regex doesn't match, so these names are not treated as version names. It doesn't match because of the trailing slash. If the trailing slash were included in the suffix match, there would still be the problem that .4/, .3/ and ./2 are a the suffix, and the version numbers are "1.2.3", "1.2", and "1", with the last "1" being a non-version-number input. Also this is noted: This functionality is implemented using gnulib's `filevercmp' function, which has some caveats worth noting. [...] * Some suffixes will not be matched by the regular expression mentioned above. Consequently these examples may not sort as you expect: abc-1.2.3.4.7z abc-1.2.3.7z abc-1.2.3.4.x86_64.rpm abc-1.2.3.x86_64.rpm Oops! And as you can see from these examples it is tricky. Sometimes suffixes contain numeric stuff, which is why it's specified that way. Here the .7z files don't match the requirement for treatment as version numbers, because the suffix, following the required period, must begin with a letter or tilde. This behaviour is unintuitive and seems wrong to me. I agree that the specification is not ideal, but it's not easy to see how it can be improved given the threat of numeric junk like 7z which cannot be treated as part of the version. Consider that 1.7z looks like a bigger version than 1.2.7z, if the 7 is wrongly treated as part of the version!!! The designers who specified the filevercmp function were clearly sober to these cases.
Re: How the ls command interpret this?
On 30.07.2017 13:46, Reuti wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Is that necessary? What's the use of reducing your plausible deniability of rubbish postings? Am 03.07.2017 um 20:30 schrieb BaBa: Le 2017-07-03 19:17, Eric Blake a écrit : Case 3: $ mkdir foo $ cd foo $ touch a[b]# the glob doesn't match, so it is passed unchanged $ ls ?[b] # the glob doesn't match, so it is passed unchanged ls: cannot access '?[b]': No such file or directory $ cd .. $ rm -rf foo Yes understood, the glob didn't match. And if globbing fails, it won't try it without. Try what without what? If globbing fails, then what happens is that the unexpanded glob pattern remains. (That's the POSIX shell behavior; GNU Bash has a "nullglob" flag which causes non-matching globs to expand to nothing, and the GNU C Library's glob() function has a similar flag.) I mean, it could use ?'[b]' and try again which would succeed. Why would globbing re-try with random permutations of the pattern syntax? (Why not also '?'[b]? Why not ?'['b]?) Such a ridiculous complication of its specification of behavior wouldn't help anyone; only lay traps for the unwary script writer.
Re: coreutils feature requests?
On 19.07.2017 10:30, Eric Blake wrote: On 07/19/2017 12:03 PM, Lance E Sloan wrote: Hi, Eric. Thank you for the thoughtful response. I regret that I have trouble understanding your point of view, though. Please know that I do not mean any disrespect. I'd appreciate it if you could explain why you're opposed to adding new features to cut (or to comm). I'm not opposed to well-justified new features. It's just that the bar for justifying new features is rather high (it's a lot of code to add to get field reordering, and that code has to be tested; it is also a question of how many users will rely on that extension. A testsuite addition is mandatory as part of adding the new feature, if the new feature is worth adding). It is nontrivial code. For instance if we look at how the function cut_bytes works in the implementation, what it's doing is simply doing a getchar() from the stream, and querying a data structure to determine whether the byte should be printed or not. (That data structure consists of a pointer which marches through field range descriptors in parallel with going through the data.) cut_fields is more complicated due to the delimiting of fields, but essentially the same overall approach. Basically, printing of fields that isn't sorted and de-duplicated is a rewrite of all parts of the utility other than command line processing and the printing of usage help text. It's like two different programs in one, sharing a minimal skeleton.
Re: coreutils feature requests?
On 19.07.2017 10:03, Lance E Sloan wrote: With regard to your objection to a special environment variable: I agree. I didn't feel strongly about it at first, but I was leaning towards not implementing env. var. support for this. It just didn't feel right. I have written programs that use env. var. to specify or override default options. However, your point about adding this to an existing, established program where it could possibly cause variable conflicts is enough to set my option a little more strongly against an env. var. for this purpose. My own objection to such influencing env-vars is rooted in "global variables are bad"; i.e. the inherited wisdom from decades of computer science and software engineering practice. Say you want to combine code from two or more scripts, all of which want to configure the same tools in different ways via globals. Now you have a mess of saving and restoring the global so that it has the value that each bit of code expects when that code is running. With magic globals we can't look at a line of code and know exactly what it will do, without reasoning about what are the current values of the globals. That reasoning may be intractable: i.e. situations in which it simply cannot be known. Then the code has to ensure that the variables have certain values by assigning them. Except, it can't just assign because that will influence some other code; it has to save the prior values, then assign, then restore when done. Of course environment variables can be scoped: OPT=value command but that's local syntax now, which defeats the purpose of the "action at a distance" effect of the variable; it might as well be transliterated to: command --opt=value and there we are.
RE: coreutils feature requests?
On 19.07.2017 06:29, Nellis, Kenneth wrote: From: Steeve McCauley I can't believe I'd never thought of reordering output columns like this. FWIW, I agree that another option should be used to prevent issues with backward compatibility. $ echo 1,2,3,4,5,6 | cut -d, -f3,5,2 2,3,5 $ echo 1,2,3,4,5,6 | cut -d, -f3,5,2 -o 3,5,2 Should this be extended to character output as well? echo output | cut -c6,4,2 -o tpu Absolutely! It would be expected behavior (IMHO). I see no reason not to. POSIX expends considerable text in requiring that the fields constitute a set, and are de-duplicated and put into order. (Quoted in earlier message.) So indeed, the behavior cannot just be changed to match the QNX "cut", not just because of backward compatibility (always the primary concern) but standard conformance also (close second). QNX has a conformance bug here. I wouldn't continue to rely on it. In addition, so that scripts can work across platforms, I (strongly) recommend that a cut-specific environment variable be defined to allow specifying the field ordering behavior. In that way my QNX 4 script (whose cut would balk at the -o option) would work with Gnu. One possibility: On the other hand, I hope I'm not alone in behind opposed to introducing new global variables which alter language or tool semantics. Scripts can target multiple implementations of a command by defining a wrapping function rather than a magic global variable. A portable script cannot rely on cut having order-preserving fields anyway. Everyone in POSIX-land implementing different features and then using a zoo of environment variables to emulate each other's features and quirks would be unmanageable.
Re: coreutils feature requests?
On 18.07.2017 15:44, Lance E Sloan wrote: Hi, all. Aside from a bug report, is there a way to submit a feature request for coreutils? I have a couple of requests in mind, which I don't think qualify as bugs: 1. Add a feature to "cut" that allows selected fields to be output in a different order. That is "cut -f4,1,8-6" would cause it to output fields in the order of 4, 1, 8, 7, and 6. I'm amazed that it doesn't work this way; the utility of implicitly sorting the fields appears low compared to the damage that it does to the flexibility of cut. (What little it has!) If POSIX specifies it, I have to say that its requirements are suboptimal (as is often the case in diverse areas). Indeed, the requirement is sadly given as: The elements in list can be repeated, can overlap, and can be specified in any order, but the bytes, characters, or fields selected shall be written in the order of the input data. If an element appears in the selection list more than once, it shall be written exactly once. Do people often write cut specifications in ad hoc orders, with repetitions and then rely on the sorting behavior? The GNU implementation of cut could lead standardization in this area to improve things. All the possible solutions to make cut not sort the fields or de-duplicate them are ugly. Either you need a global option to opt out of that behavior, like "-o" (preserve (o)rder) and keep remembering to use it, or else provide order-preserving versions of the various options, like perhaps through capitalization: -F, -C, -B. An attractive alternative is to have a whole new command which mirrors cut, like say "clip", which is exactly like cut, but order and repetition preserving. Note that the --complement option is semantically incompatible with order-preserving mode; the complement concept follows from the selected elements being regarded as a set rather than ordered sequence. If -F, -C or -B is used with --complement, it has to be diagnosed. Or if there is a "clip" command, then that simply doesn't support --complement.
Re: Determination of file lists from selected folders without returning directory names
On 18.07.2017 01:17, SF Markus Elfring wrote: I imagine that there are more advanced possibilities to improve the software run time characteristics for this use case. Well, if it has to be fast, perhaps don't write the code in the shell language. To which “shell” would you like to refer to? The "Shell Command Language", called by that name in POSIX, and to its dialects. Even an interpreted scripting language that can do string handling without resorting to fork()-based command substitution will beat the shell at many tasks. How do you think about additional approaches to reduce the forking of special processes? I think: don't do text processing whose speed matters in a language where you have to even think about the issue "how do I reduce fork() occurrences in string processing code" and in which you don't even know whether some command involves a fork or is a built-in primitive. If you've resigned to developing something in the shell, and that something has to process many items of data, try not to write a shell loop for the task, and try to avoid idioms which run a process for each item. Rather, coordinate commands which do the heavy lifting. If I had to strip a large number of paths to their basenames, and it had to be done in portable shell code, I would filter those names through sed: one process invocations and some pipe inter-process I/O. I.e. we can use the basename function: for name in dir/*txt; do basename "$name" done prints the basenames of the matching files, one per line. There is also the GNU variant available for such a command. for X in $(basename --suffix=.txt dir/*txt); do my_work $X; done What you're doing here is destroying the validity of these expanded paths; the "my_work" command or function cannot access things through these paths, unless it restores the "dir/" prefix, which it has not been given as an input. When you expand dir/*txt, each one of the expansions is a correct relative path to an object. The stripped basenames aren't. Whatever "my_work" is doing, if it involves accessing the files, you're probably making its job more difficult. But how often can it be avoided to delete extra data like prefixes (and suffixes)? Pretty much all of the time. Can it occasionally be a bit more efficient to provide only the essential values at the beginning of an algorithm so that so they will be extended on demand? That sounds like a generic description of the whole body of "lazy" or "late binding" techniques; but it's unclear how it is supposed to apply here. Maybe "my_work" could be given relative paths that resolve; if it needs shortened names for some reason, let it compute them. Or "my_work" could be given a quoted pattern: my_work '*.txt' then it can expand it as needed, in whatever directory it wants.
Re: Determination of file lists from selected folders without returning directory names
On 17.07.2017 12:37, SF Markus Elfring wrote: A corresponding result could be achieved by using a subshell (which I would like to avoid for this use case) for a command like “(cd ${my_dir} && ls *txt)”. If you want to capture these strings into a variable, you can't really avoid a sub-process. I looked at a programming interface like the function “opendir”. I imagine that there are more advanced possibilities to improve the software run time characteristics for this use case. Well, if it has to be fast, perhaps don't write the code in the shell language. Even an interpreted scripting language that can do string handling without resorting to fork()-based command substitution will beat the shell at many tasks. it can be done like this: for name in dir/*txt ; do echo ${name#dir/} done I would like to avoid such an operation “Remove matching prefix pattern” generally. If the desired file lists contain only basenames, extra prefixes do not need to be deleted. I.e. we can use the basename function: for name in dir/*txt; do basename "$name" done prints the basenames of the matching files, one per line.
Re: Determination of file lists from selected folders without returning directory names
On 17.07.2017 10:25, SF Markus Elfring wrote: Hello, The tool “ls” supports the filtering of directory contents. No, it actually doesn't. In a shell command like ls *.txt it is actually the shell which performs the *.txt filename expansion, before the ls program is executed. That program receives all of the file names as individual arguments, rather than the original pattern. If you just want the names themselves, you can simply do: echo *.txt One advantage of this is that it avoids the limitations on the size of the argument vector that can be passed to a child process, because echo is built-in (if we're talking about GNU Bash, which is reasonable, given that we're in the GNU Coreutils list.) Iterating over the matching names is possible using the for syntax: for x in *.txt ; do commands ... done That also avoids limitations on argument passing since it is all built-in syntax. A corresponding result could be achieved by using a subshell (which I would like to avoid for this use case) for a command like “(cd ${my_dir} && ls *txt)”. If you want to capture these strings into a variable, you can't really avoid a sub-process. For instance simply doing: names=$(echo *txt) involves a sub-process for the command substitution. Since it is echo, it could be optimized. I just checked though; bash 4.3.48 is forking a child process for this. So there is hardly much additional disadvantage from doing: names=$(cd; echo *txt) that just adds a chdir() system call to the child script. If the goal is to just dump the names on standard output without the directory prefix, it can be done like this: for name in dir/*txt ; do echo ${name#dir/} done this involves no forking of a child process. To get them on the same line as with echo: for name in dir/*txt ; do printf "%s " "$name" done echo # emit final newline
Re: null separated ls output option
On 28.06.2017 23:31, Bernhard Voelker wrote: [adding findutils] First of all, find(1) is maintained in the GNU findutils project, not in GNU coreutils. Redirected from: http://lists.gnu.org/archive/html/coreutils/2017-06/msg00049.html On 06/28/2017 07:13 PM, Kaz Kylheku (Coreutils) wrote: > [ snip ... my elaborate proposal for a find -sort predicate ] I think the GNU toolbox already gives you enough flexibility to support these edgy use cases, e.g. sorting by file size: find . -printf "%s/%p\0" \ | sort -zt '/' -k1,1n \ | sed -z 's,^.*/,,' That is great. So there is nothing do here, basically; all the better. No null-terminated patch for ls is required; GNU sort can sort the specially formatted null-terminated output from find. See; it pays to have a discussion about the requirements before whipping up code.
Re: null separated ls output option
On 28.06.2017 06:53, ra...@openmailbox.org wrote: On 2017-06-01 04:45, Pádraig Brady wrote: On 31/05/17 15:24, ra...@openmailbox.org wrote: Existing tools like find(1) were thought sufficient but find does not support sorting by date which ls does. I hope this patch can be reconsidered for inclusion. Rather than the obvious: patching find so that it supports sorting the paths by date, or other properties? This could be a "-sort " predicate, which is understood to be like "-print", except it sends the visited node into a sorting bucket, which is spilled when find finishes executing. Sort buckets are identified by the "" syntax as a key, so multiple occurrences of the predicate giving the same go to the same bucket. Multiple occurrences of -sort with different keys route to different buckets; these buckets can be later dumped in left to right order, based on the position of the leftmost predicate which specifies each bucket. could use + and - as prefixes for ascending and descending (defaulting to + if omitted) followed by a word which is derived from the space of predicates: atime, ctime, name, iname, ... Comma separation for compound keys? -sort mtime,name Something like that.
Re: [PATCH] env: support encoding of args into command.
On 29.05.2017 04:29, Eric Blake wrote: On 05/27/2017 07:30 PM, Kaz Kylheku (Coreutils) wrote: Bascially I'm completely against almost every aspect of this -S design; and I suspect the POSIX standardization people (Austion Group) won't adopt it, either, so it will forever remain just a FreeBSD feature (and we can help keep it that way by not copying it). The Austin Group has already declared that #! is non-portable, and that portable scripts can't use it, BECAUSE of the wide variety in how kernels handle it and the small limits on how much you can cram in that line. Gentleman, please disregard the patch. I don't care about it any more because I have discovered a hack which makes it pointless. With excellent language-level backward compatibility, a given scripting language interpreter "interp" can provide support for being invoked in the following manner: #!/usr/bin/env interp\000trailing material Here \000 represents a literal embedded null byte. So, of course, env receives arg[1] as "interp" and finds the interpreter properly. This is the case whether the kernel stops reading the string after the null, or wheter the kernel passes the character array "interp\000trailing material" as argv[1] to env. either way, env only sees "interp". The interpreter can then open the script and read the full line, look for the null byte, and give a meaning to "trailing material". The interpreter can, in that space, implement the equivalent of my argument delimiting approach, or the more elaborate one taken in BSD's env -S. The notation is very space efficient: just one delimiting character which positively requires no escaping. It doesn't require on adding a second line to the script for encoding the material, which can change the meaning of existing scripts. It also potentially defeats limitations on hash bang line size. Why? Because the only requirement which has to be met is that the null byte occurs within the header size limit! Not the entire hash bang line. The programmer is not relying on the hash bang mechanism to pass anything after the null byte through the command line, so if any of it is cut off, that is immaterial. So far, I have tried this on Darwin, Linux, Solaris and Cygwin: works fine! A possible objection is that every interpreter has to implement its own hack for recognizing the material after the null byte and doing something. The solution for that, of course, is to provide a library function for dealing with it: a function which takes (argc, argv), and index of which argv[] is the script name, and returns a transformed (argc, argv). The thing to do is to develop develop that library function to make it easy for interpreter writers to just "drop in".
Re: [PATCH] env: support encoding of args into command.
On 27.05.2017 08:06, Pádraig Brady wrote: Now the FreeBSD env(1) command supports the -S option to do splitting much as you've proposed, so I see no need to be different. Could you adjust the patch accordingly? I think it's probably best to avoid copying FreeBSD here. They have created a mess with too many requirements. Their -S feature has quitespace delimiting, quoting, escaping of quotes, C-like character escape sequences: \n, \r, \f, ... and ${PARAM} environment variable expansion. A little shell language seems to be brewing inside the FreeBSD env program. What's it all for? Yet, for all the "bells and whistles", they are missing the {} feature to insert the first argument (the name of the script in hash bang dispatch) among the generated arguments, something which can't be achieved with any combination of environment variable substitutions. It is useful because it allows the hash bang to specify some arguments after the script (which could be arguments belonging to the script rather than to the interpreter, for instance). Let's look at the delimiting requirement. Using spaces perpetrates a clever visual ruse. The command line: $ /usr/bin/env -S a b c translates directly to hash bang: #!/usr/bin/env -S a b c the semantics is different, of course; here the string " a b c" is one argument to -S, subject to splitting. It's understandable why they have it this way, but I believe it is fine not to play this sort of trick and make the mechanism have a visually explicit syntax. I chose the colon character for a very good reason: it is the PATH separator. Which means, it doesn't occur the command argument in any correct usage of env, and is therefore available as a separator. I don't envision a need to support quoting the : character for inclusion in an argument. The main reason to include : in an argument is to support the -P altpath feature of env, another FreeBSD extension. This is also overdesigned. If you don't know where the program is, rely on PATH. If you know where the program is, put its exact path into the hash bang script, and don't use the env utility. The situation "I don't know where the software might be installed, but it's in one of these several locations which are not in the PATH" is fairly contrived, except in one way: when it is known that the software is in one of the secure, default system locations for programs such as /bin:/usr/bin, but the PATH could have been altered not to look in these locations first. Instead of -P altpath, a single letter option with no argument can indicate that PATH is to be reset to the secure default, so that the correct program is found, or else env fails. Bascially I'm completely against almost every aspect of this -S design; and I suspect the POSIX standardization people (Austion Group) won't adopt it, either, so it will forever remain just a FreeBSD feature (and we can help keep it that way by not copying it).
Re: [PATCH] env: support encoding of args into command.
On 24.05.2017 18:10, Kaz Kylheku (Coreutils) wrote: This is a new feature which allows the command argument of env to encode multiple extra arguments, as well as the relocation of the first trailing argument among those arguments. Looks like my MUA screwed this up with "format=flowed" and quoted printable. I will re-send using the "mail" utility.
[PATCH] env: support encoding of args into command.
This is a new feature which allows the command argument of env to encode multiple extra arguments, as well as the relocation of the first trailing argument among those arguments. * src/env.c (usage): Mention the existence of the feature. (expand_command_notation): New function. (main): Detect whether the notation is present, based on the first character of command. If so, filter the trailing part of the argument vector through the expand_command_notation function, and use that. Either way, the effective vector is referenced using the down_argv variable and that is used for the execvp call. If an error occurs, the diagnostic refers to the first element of down_argv rather than the original argv. * tests/misc/env.sh: Added some test cases. Doesn't probe all the corner cases. I solemnly declare that I manually tested those corner cases, like "env :" and "env :{}" and such, and used valgrind for all the manual testing to be confident that there are no overruns or uses of uninitialized bytes. * doc/coreutils.texi: Documented feature. Added discussion about how env is often used for the hash bang mechanism, and how the feature relates to this use. --- doc/coreutils.texi | 63 + src/env.c | 64 -- tests/misc/env.sh | 18 +++ 3 files changed, 143 insertions(+), 2 deletions(-) diff --git a/doc/coreutils.texi b/doc/coreutils.texi index 1834e92..9e1cb0c 100644 --- a/doc/coreutils.texi +++ b/doc/coreutils.texi @@ -16879,6 +16879,69 @@ env -u EDITOR PATH=/energy -- e=mc2 bar baz @end itemize +Note that the ability to run commands in a modified environment is built into +the shell language, using a very similar @samp{@var{variable}=@var{value}} +syntax; moreover, that syntax allows commands internal to the shell to be run +in a modified environment, which is not possible with the external +@command{env}. Other scripting languages usually also have their own built-in +mechanisms for manipulating the environment around the execution of a child +program. Therefore the external @command{env} executable is rarely needed for +the purpose of running a command in a modified environment. Because the +@command{env} utility uses @env{PATH} to search for @var{command}, it has come +to be mainly used as a mechanism in "hash bang" scripting. In this usage, +scripts are written using the incantation @samp{#!/usr/bin/env interp} where +@var{interp} is the name of some scripting language interpreter. The +@command{env} utility provides value by searching @env{PATH} for the location +of the interpreter executable. This allows the interpreter to be installed in +some chosen location, without that location having to be edited into the hash +bang scripts which refer to that interpreter. + +On some operating systems, the following issue exists: the hash bang +interpreter mechanism allows only one argument. Therefore, if the @command{env} +incantation @samp{#!/usr/bin/env interp} is used, it is not possible to pass an +argument to @samp{interp}, which is a crippling limitation in some +circumstances requiring clumsy workarounds. To overcome this difficulty, the +GNU Coreutils version of @command{env} supports a special notation: +arguments for @var{command} can be embedded in the @var{command} argument +itself as follows. If @var{command} begins with the @samp{:} (colon) +character, then that colon character is removed. The remainder of the +argument is treated as record of colon-separated fields, and split +accordingly. For instance if @var{command} is @samp{:foo:--bar:42}, then +it is split into the fields @samp{foo}, @samp{--bar} and @samp{42}. The +effective command is then just @samp{foo}. The other two fields will be +passed as the first two arguments to @samp{foo}, inserted before the +remaining @var{args}, if @samp{foo} is successfully found using +@env{PATH} and executed. +Furthermore, this special supports one more refinement. +If, after colon splitting, one or more of the fields are +equal to the character string @samp{@{@}} (open brace, closed brace) +then the leftmost such field is replaced with the first of the @var{args} +which follow @var{command}. In this case, that argument is removed from +@var{args}. If @var{args} is empty, then the field is not replaced. + +Example: @command{env} hash bang line for a script executed by the +fictitious @samp{intercal} interpreter. The @samp{--strict-iso} option +is passed to the interpreter, and the @samp{--verbose} option is +passed to the script: + +@example +#!/usr/bin/env :intercal:--strict-iso:@{@}:--verbose +... script goes here ... +@end example + +When the above hash bang script is invoked with the arguments @samp{alpha} and +@samp{omega}, @command{env} is invoked with four arguments arguments: the +argument @samp{:intercal:--strict-iso:@{@}:--verbose}, followed by the +path name to the above script itself, follo