Re: uniq without the need of sort

Bob Proulx Tue, 08 Nov 2011 09:53:03 -0800

Peng Yu wrote:
> 'uniq' currently relies on 'sort'.

But uniq doesn't rely upon sort.  There is no requirement that the
input file be sorted.  That is up to the caller.  The uniq program
only compares adjacent lines.  It has a very short memory.  In effect
it is collapsing adjacent identical lines.


  $ printf "A\nA\nB\nA\nA\n" | uniq
  A
  B
  A

> When the input file is small, this is OK. But when the input file is
> large, this seems to be a waste (the complexity is O(n log(n)),

But uniq does not keep track of this information currently.  Adding it
would be a large increase in capability to this simple program.  (The
V7 uniq.c is only 124 lines of code in total.)

It isn't a waste currently since it isn't ever done in uniq at all and
it is only done in sort if the author decides it is needed.  And once
done in sort it isn't repeated in uniq.  It is done in one place only
(sort) and only if the author asks for it to be done (by calling
sort).

> if uniq handles a hash table its self the complexity is only O(n)).

This is the point where the analysis is based upon an incorrect
assumption.  Since uniq does not keep a hash of input it does not
expend that energy.  It is only comparing adjacent lines.  That could
be a single line cache.

If you are proposing that code be added to uniq to keep a table of all
input lines then that is a large feature proposal relative to the
current features of the program.

> I'm wondering if it is better to relax the requirement of 'sort'
> when 'uniq' is used.

I think you are asking for an option to uniq that would optionally
sort the input.

I think that those operations are already fully supported by having
sort be available for use.  If you need the data to be sorted then you
sort it.  If you don't need it sorted then you don't.  I think that it
would be undesirable feature creep into the program to have sort
either keep track of all input lines or to sort them.  Also running
sort and uniq in different processes can make better use of
multi-processor machines.

On the other side of the debate, for large files moving a lot of data
through pipes can take a significant amount of time.  In a performance
bound application on a single processor machine removing the character
I/O through pipes can speed things up.  (But if an application is
performance bound then perhaps doing it in a dedicated application
instead of the shell is best.)

Bob

Re: uniq without the need of sort

Reply via email to