Hi Assaf, Thanks for prompt reply You bring up good points about POSIX compliant, and the availability of the datamash tool.
For the first point, I would note that most coreutils goes well beyond POSIX. Consider "cp", which has many useful additions beyond the POSIX features. The second point is about availability of other tools to achieve similar task. This is a "judgement call where this functionality belong. There is no single right answer here. Such implementation can be done with few lines of code in any scripting solution My main point is that given that the very common use case for 'uniq' is combined with other coreutils functions (sort, cut, sed), it make sense to have an efficient implementation for "counting unique values" available within "coreutils", instead of sending the user to look for a solution elsewhere, or to implement his own. Hope this make sense. Yair On Sat, May 30, 2020 at 7:47 AM Assaf Gordon <[email protected]> wrote: > > Hello, > > On 2020-05-29 10:16 p.m., Yair Lenga wrote: > > Wanted to suggest that the team will look (again) at implementing > > --unsorted option for 'uniq'. > > > > The idea was proposed (and rejected) about 10 years ago > > (https://lists.gnu.org/archive/html/coreutils/2011-11/msg00016.html). > > Lot of things have changed from the past. > > > [...] > > > > Can you advise/provide feedback. I'm sure that there will be many > > volunteers (me included) to contribute to such important improvement. > > "uniq" is standardize by POSIX to work on "comparing adjacent lines" > (from: > https://pubs.opengroup.org/onlinepubs/9699919799/utilities/uniq.html ) - > hence the requirement to pre-sort the input. > > While it could be extended with a completely different hash-based > implementation, I don't think this is likely to happen. > > As an alternative (and a shameless plug), allow me to point to > GNU Datamash ( https://www.gnu.org/software/datamash/ ). > On one hand, it already has a hash-based implementation to > remove duplicated fields (called "rmdup"). > consider the following contrived example: > > $ (printf "%s\t%s\n" 9 B 3 A ; seq 10 | paste - -) | datamash rmdup 1 > 9 B > 3 A > 1 2 > 5 6 > 7 8 > > And on the other hand, because 'datamash' is non-standard, > there's less of a problem in adding new functionality (i.e. "bloat" is > not as big as a concern as it is for coreutils). > > Hope this helps. > > regards, > - assaf > >
