Hello Lance and all, On Jul 19, 2017, at 13:03, Lance E Sloan <sloanlance+coreutils_gnu.org@ gmail.com> wrote:
I'd appreciate it if you could explain why you're opposed to adding new features to cut (or to comm). If I may chime in about this: There is also a delicate balance between adding more features and leading to bloated software, and keeping the program lean but providing less functionality. Sometimes, it boils down to a judgement call of the maintainers. And then there's also the unix philosophy of having a tool "do one thing and do it well". It may help if I explain my point of view. I think that what would help the most is if you can share an actual problem that you have (i.e. the input and desired output), and perhaps we can find a good solution using existing tools. My considerations for a solution: 1. I need this feature to process several files that have millions of lines each. I need to do this on an ongoing, periodic basis. I can't afford for the process to be slow. Here's a concrete example: === $ time wc -l 1.txt 18833902 1.sql real 0m0.442s user 0m0.232s sys 0m0.208s $ time cut -f1,3,5 1.txt > /dev/null real 0m2.923s user 0m2.736s sys 0m0.188s $ time mawk '{print $3,$1,$5}' 1.txt > /dev/null real 0m4.903s user 0m4.680s sys 0m0.224s ==== Using existing tools ('mawk' in this case) gives you all of awk's flexibility, at a slight increase of cost. The example file had 18M lines - and we're still talking about just 4 seconds of user time. 2. Since I have a large amount of data, I'm avoiding regular expressions and interpreted languages, which take longer to complete the job. That eliminates awk and several other possible solutions. A compiled C application would be best. I agree that regex on every line is slow, but awk just for the sake of reordering lines will not require any regex. 4. Part of my data processing uses jq. I've figured out how to do this field reordering with it, but it makes my jq filter more complex and more difficult for my successors to maintain. As written on https://stedolan.github.io/jq/ , "jq is like sed for JSON data". I don't consider sed to be a good solution for a problem of this size, so jq probably isn't ideal, either. (disclaimer: I'm a maintainer of GNU sed, and I've also contributed code to jq). This point confuses me a bit: cut,awk,sed are all line-based tools: meaning your logical "records" are expected to be on one line (at least - that is the most common usage of the tools). jq is JSON based - it doesn't at all need records to be contained in a single line (though it can do it with optional arguments). Is your input file "one JSON record per line" ? Or are you using 'jq' to read non-JSON input and treat it as an array? In any case: 'jq' implements a small virtual machine to execute your script, I'm not sure it would be the fastest tool for the job (or much faster than sed/awk's implantation). It is certainly an "interpreted language" which you wrote above you are trying to avoid. Similarly, saying "I don't consider sed to be a good solution" - you haven't yet told us what your actual need is. So we can't tell if sed is good or bad... Since a C implementation should run the fastest and cut from GNU's coreutils is written in C and presumably doesn't need much work to support this, it seems like the best solution. Even if this feature suggestion isn't approved by the GNU community, I will implement it for my own use anyway. I can enjoy the new functionality (which I think should have been added to cut long ago) and keep it to myself or I can contribute it back to the online community. I could distribute it as my own fork of GNU coreutils or as a patch to it. However, if it were merged into GNU's coreutils, it would get the most exposure and be helpful to more people. cut's implementation is optimized for cutting columns and not for reordering them. I think that if you try to add code to 'cut' that allows reordering of output fields, you'll discover that while it's very doable, it also significantly complicates the code. A previous message on this thread stipulated that it takes extra effort to 'sort the columns' - that is incorrect for the current implementation. Regardless, If you actually implement it - please do send the patch. To be considered for inclusion, it will need to be efficient (i.e. not make 'cut' slower than it is now), be correctly implemented for all sorts of edge-cases, and have good tests that cover the new functionality. Updating manual pages and the documentation is needed as well. You'll also need to assign copyright of the patch to the FSF. Good places to start: http://git.savannah.gnu.org/cgit/coreutils.git/tree/HACKING http://git.savannah.gnu.org/cgit/coreutils.git/tree/README-hacking Here's an example of a patch that added a new feature to 'comm': https://git.savannah.gnu.org/cgit/coreutils.git/commit/?id= b50a151346c42816034b5c26266eb753b7dbe737 You can see that the kind of changes at accompany a new feature. Then again, Given that the canonical way to reorder columns for many decades has been: awk '{print $9,$2,$6}' and that this canonical way would 'just works' on *any* existing posix system (think: every BSD, Solaris, AIX, and systems such as AlpineLinux which use BusyBox instead of GNU coreutils) - there is a very high barrier to adding such non-standard feature. regards, - assaf