On Tue, Nov 30, 2021 at 9:12 PM Joe Sapp <sa...@ieee.org> wrote: > > If you're trying to get around the maximum argument limit, maybe this will > help: > https://www.gnu.org/software/parallel/man.html#example-inserting-multiple-arguments > Or this: > https://www.gnu.org/software/parallel/man.html#example-processing-a-big-file-using-more-cpus > > Parallel can break up the calls to a command, limiting the number of > arguments to the maximum allowed on the command line. But then you > won't have one sorted file in the end. Try the examples, use "-m -j1" > or "-X -j1", and do a final `sort -u` on the output file.
I still not so sure whether these tricks can deal with the following problem analyzed by Janis Papanagnou [1]: The issue stems from the fact of a limited exec-buffer size and that [shell-external] commands will operate on that limited buffer. Whenever your sample size - actually the argument list size - will exceed that limit the outcome is unreliable and depends on the data used; it may work in 10 cases and fail in 100, or vice versa, it may work for all your application cases (because you are operating only on toy data), or it may always fail (because you are working with huge amounts of scientific data), or anything else. To understand the issue it suffices to assume small values, say a buffer-size of 15 and a few short arguments. Say you have the file arguments A B C D ... Z and want to sort them. Say in the buffer there's room for only 5, so that sorting with above 'find'-based constructs will result in many calls; sort A B C D E sort F G H I J ... sort Z and the output will be the concatenation of the individual calls. A..E will be sorted, F..J will be sorted, etc. but A..Z will not be sorted after the concatenation of the individual sorted parts. Very subtle errors can occur this way if one is not aware of that fact; the result may look correct if one looks at the first few MB of the result, but may actually be wrong. Whether other tools (like the one mentioned below) circumvent the exec-buffer issue must be checked - but I wouldn't expect it does. What a tool would need to do is either the ability to see all data in one call, or to create partly sorted data and make more sort runs on that partly sorted data; merge-sort is an algorithm that works that way (which had been used on sequentially operating tape archives especially in former times). [1] https://groups.google.com/g/comp.unix.shell/c/ha5t3U54GmY/m/pGYxDLvRBAAJ