bug#33371: RFC: option for numeric sort: ignore-non-numeric characters
tags 33371 notabug close 33371 stop Hello, On 2018-11-18 6:08 p.m., L A Walsh wrote: On 11/14/2018 12:27 AM, Erik Auerswald wrote: Perhaps --version-sort could work for you? "-V" seems like it might be sufficient, Given the above, I'm closing this item. regards, - assaf
bug#33371: RFC: option for numeric sort: ignore-non-numeric characters
On 11/14/2018 12:27 AM, Erik Auerswald wrote: Hi, On Tue, Nov 13, 2018 at 06:32:55PM -0800, L A Walsh wrote: I have a bunch of files numbered from 1-over 2000 without leading zeros (think rfc's)... They have names with a non-numeric prefix & suffix around the number. Are prefix and suffix constant? RFC files are usually named rfc${NR}.txt. It would be nice if sort had the option to ignore non-numeric data and only sort on the numeric data in the 'lines'/'files'. Perhaps --version-sort could work for you? $ for r in rfc{1..100}.txt; do echo "$r"; done | sort | sort -V (The first sort un-sorts the sorted input data, the seconds sorts it again.) - Tried this... had initial "turn-off" with using a for loop to list files when '/bin/ls -1 *.txt' was so much shorter. However, just the 'sort -V' works by itself, works. I'm not sure exactly why, but that wasn't initially clear to me, though, maybe should have been, having written version-sort more than once before. Minor gotchas, using single numbers, the for loop produced: rfc1.txt rfc2.txt rfc3.txt rfc4.txt rfc5.txt rfc6.txt rfc7.txt rfc8.txt rfc9.txt while the '/bin/ls -1 rfc?.txt|sort -V' algorithm produced: rfc1.txt rfc2.txt rfc3.txt rfc4.txt rfc5.txt rfc6.txt (7-9 didn't exist in the directory) [...] Or is there an options for this already, and my manpage out of date? AFAIK not exactly. Thanks, Erik "-V" seems like it might be sufficient, but I doubt most non-computer types would know that -V would sort multiple numeric fields separated by invariant non-numeric characters in a numeric fashion (or would even know how a version sort is the other sorts). Given how well read docs are these days, almost need a literal definition of 'version sort' besides just calling it a 'version sort' (which we must admit, is 'jargon'). Along the lines of: --version-sort | -V Sees inputs as a mix of numeric and alphabetic (or "identifier") fields, where the numeric fields are sorted naturally, and alpha fields sorted alphabetically. Fields may have separators like '.', '_', or '-', sometimes constrained by a specific computer language, or may have no separator at all between numeric and alpha fields. This is type of sort is often called a "version sort" in the computer field. ??? I listed 'version sort' at the end, as the equivalence so those who tend to skip and read initial parts of lines/paragraphs would not just see "version sort" and gloss over the rest, inserting their own equivalence for the definition -- especially likely w/"version-sort" being the long form of the switch.
bug#33371: RFC: option for numeric sort: ignore-non-numeric characters
On 11/13/2018 6:44 PM, Eric Blake wrote: On 11/13/18 8:32 PM, L A Walsh wrote: I have a bunch of files numbered from 1-over 2000 without leading zeros (think rfc's)... They have names with a non-numeric prefix & suffix around the number. It would be nice if sort had the option to ignore non-numeric data and only sort on the numeric data in the 'lines'/'files'. Yeah, I can renumber and rename them all, but I just wanted an instant command that could sort numeric values even if embedded in a line, where the "field" was determined by the start/stop of numeric characters. Or is there an options for this already, and my manpage out of date? Without ACTUAL data to experiment with, it's much harder for anyone else to propose a solution that will work with your specific data. ...think rfcs...um have you ever looked at the directory with a bunch (all or most) rfc in it? But one quick approach comes to mind: decorate-sort-undecorate: sed 's/^\([^0-9]*\)\([0-9]*\)/\2 \1\2/' < myinput \ | sort -k1,1n | sed 's/^[0-9]* //' > myoutput That does work, but still seems a bit odd on a numeric sort not to have it, even by default, ignore non-numeric data in front or after. I may be imagining this, but I though I'd seen some version of sort that did this -- simply skipping the non numeric characters and sorting on the numbers. Instead this sort reverted to alpha sort. Thinking about it...if I ask for numeric sort, shouldn't it at least try to look for numbers in each line to sort them? That seems like it might be a user-friendly and even consistent thing to do, considering there are options to 1) ignore leading blanks 2) ignore case 3) ignore nonprinting... ( this most close parallels the request, since when when doing an alpha sort, one might hope it could ignore what isn't visible). 4) "human sort" --- actually this option sorta makes it look like a bug, since this sort ignores things that don't look like a number+suffix). So why wouldn't numeric sort do the same? I'd even sorta hoped the -h sort might work for this... since if you were showing sizes, and only had values in 'bytes', you wouldn't see the suffixes. So I'd hoped that it would order rfc98.txt before rfc979.txt, but such is not the case. I.e. in the case of 'ls', it ignores junk before and after the numbers+optional unit). So one might wonder why it doesn't properly sort the numbers with 'rfc' before them and '.txt' after them. I.e. should 4 have worked maybe? Might be a bit perverse, but can't see why not.
bug#33371: RFC: option for numeric sort: ignore-non-numeric characters
Hi, On Tue, Nov 13, 2018 at 06:32:55PM -0800, L A Walsh wrote: > I have a bunch of files numbered from 1-over 2000 without leading zeros > (think rfc's)... > They have names with a non-numeric prefix & suffix around the number. Are prefix and suffix constant? RFC files are usually named rfc${NR}.txt. > It would be nice if sort had the option to ignore non-numeric > data and only sort on the numeric data in the 'lines'/'files'. Perhaps --version-sort could work for you? $ for r in rfc{1..100}.txt; do echo "$r"; done | sort | sort -V (The first sort un-sorts the sorted input data, the seconds sorts it again.) > [...] > Or is there an options for this already, and my manpage out of date? AFAIK not exactly. Thanks, Erik -- It's impossible to learn very much by simply sitting in a lecture, or even by simply doing problems that are assigned. -- Richard P. Feynman
bug#33371: RFC: option for numeric sort: ignore-non-numeric characters
On 11/13/18 8:32 PM, L A Walsh wrote: I have a bunch of files numbered from 1-over 2000 without leading zeros (think rfc's)... They have names with a non-numeric prefix & suffix around the number. It would be nice if sort had the option to ignore non-numeric data and only sort on the numeric data in the 'lines'/'files'. Yeah, I can renumber and rename them all, but I just wanted an instant command that could sort numeric values even if embedded in a line, where the "field" was determined by the start/stop of numeric characters. Or is there an options for this already, and my manpage out of date? Without ACTUAL data to experiment with, it's much harder for anyone else to propose a solution that will work with your specific data. But one quick approach comes to mind: decorate-sort-undecorate: sed 's/^\([^0-9]*\)\([0-9]*\)/\2 \1\2/' < myinput \ | sort -k1,1n | sed 's/^[0-9]* //' > myoutput -- Eric Blake, Principal Software Engineer Red Hat, Inc. +1-919-301-3266 Virtualization: qemu.org | libvirt.org
bug#33371: RFC: option for numeric sort: ignore-non-numeric characters
I have a bunch of files numbered from 1-over 2000 without leading zeros (think rfc's)... They have names with a non-numeric prefix & suffix around the number. It would be nice if sort had the option to ignore non-numeric data and only sort on the numeric data in the 'lines'/'files'. Yeah, I can renumber and rename them all, but I just wanted an instant command that could sort numeric values even if embedded in a line, where the "field" was determined by the start/stop of numeric characters. Or is there an options for this already, and my manpage out of date? Thx -l