Peng Yu wrote: > 'uniq' currently relies on 'sort'. But uniq doesn't rely upon sort. There is no requirement that the input file be sorted. That is up to the caller. The uniq program only compares adjacent lines. It has a very short memory. In effect it is collapsing adjacent identical lines.
$ printf "A\nA\nB\nA\nA\n" | uniq A B A > When the input file is small, this is OK. But when the input file is > large, this seems to be a waste (the complexity is O(n log(n)), But uniq does not keep track of this information currently. Adding it would be a large increase in capability to this simple program. (The V7 uniq.c is only 124 lines of code in total.) It isn't a waste currently since it isn't ever done in uniq at all and it is only done in sort if the author decides it is needed. And once done in sort it isn't repeated in uniq. It is done in one place only (sort) and only if the author asks for it to be done (by calling sort). > if uniq handles a hash table its self the complexity is only O(n)). This is the point where the analysis is based upon an incorrect assumption. Since uniq does not keep a hash of input it does not expend that energy. It is only comparing adjacent lines. That could be a single line cache. If you are proposing that code be added to uniq to keep a table of all input lines then that is a large feature proposal relative to the current features of the program. > I'm wondering if it is better to relax the requirement of 'sort' > when 'uniq' is used. I think you are asking for an option to uniq that would optionally sort the input. I think that those operations are already fully supported by having sort be available for use. If you need the data to be sorted then you sort it. If you don't need it sorted then you don't. I think that it would be undesirable feature creep into the program to have sort either keep track of all input lines or to sort them. Also running sort and uniq in different processes can make better use of multi-processor machines. On the other side of the debate, for large files moving a lot of data through pipes can take a significant amount of time. In a performance bound application on a single processor machine removing the character I/O through pipes can speed things up. (But if an application is performance bound then perhaps doing it in a dedicated application instead of the shell is best.) Bob
