bug#32099: Benchmarks: Hashing ~70% faster (Was: New uniq option for speed)

Bernhard Voelker Wed, 11 Jul 2018 23:46:14 -0700

On 07/10/2018 08:21 PM, Assaf Gordon wrote:
> I would suggest several improvements to the measurements, to ensure
> relevant information is conveyed:
> 
> 1.
> [...]


great summary.  With the risk of mentioning already said aspects:

7. Consider all the existing options, i.e. processing modes, of 'uniq' as well.
This means, it has to be considered (upfront!) whether introducing an alternate
way to do things - in this case hashing - only serves for the trivial case,
or whether this would slow down or even contradict the processing with other
options, currently:

  -c, --count           prefix lines by the number of occurrences
  -d, --repeated        only print duplicate lines, one for each group
  -D                    print all duplicate lines
      --all-repeated[=METHOD]  like -D, but allow separating groups
                                 with an empty line;
                                 METHOD={none(default),prepend,separate}
  -f, --skip-fields=N   avoid comparing the first N fields
      --group[=METHOD]  show all items, separating groups with an empty line;
                          METHOD={separate(default),prepend,append,both}
  -i, --ignore-case     ignore differences in case when comparing
  -s, --skip-chars=N    avoid comparing the first N characters
  -u, --unique          only print unique lines
  -z, --zero-terminated     line delimiter is NUL, not newline
  -w, --check-chars=N   compare no more than N characters in lines

Without deeper thinking about it, especially the combinations with
the --group, --repeated and --unique options might be problematic.

8. Standards.  POSIX says:
  http://pubs.opengroup.org/onlinepubs/9699919799/utilities/uniq.html

  The uniq utility shall read an input file comparing adjacent lines,
  and write one copy of each input line on the output. The second and
  succeeding copies of repeated adjacent input lines shall not be written.

This makes an assumption both on the input and output.

Regarding the input, this means that the processing just has to remember
the previous line in order to decide whether to print/discard it in the
output.  Therefore, arbitrary input size is fine.
Hashing most probably will have issues with arbitrary input size;
I do not talk about 1GB files, but _much_ larger files (yes, we
are in 2018 where 1GB is like nothing).

Regarding the output, this means that the output is implicitly in
sort order as well.  Like the input, uniq can discard the already
written data from memory because it is sure that uniq doesn't need
to consider it anymore.


Thus said, a successful optimization idea does not only have to show
that it is faster or needs less resources in _some_ cases, but also
must prove that it does not tear down many cases including extreme
ones.  The current implementation might be as-is for a good reason.
If it turns out that the optimization idea screws up a single
of the above use cases, then the dilemma is whether 2 implementations
are warranted to be kept (maintenance!), and whether it is possible
to detect the extreme cases early enough to switch from the default
to the other processing strategy.

https://en.wikipedia.org/wiki/Perfect_is_the_enemy_of_good
or D. Knuth:
  "premature optimization is the root of all evil
   (or at least most of it) in programming"

As Assaf said, a patch with a proof of concept would be helpful ... even
if it's just helpful to proof that the current implementation is fine.

Have a nice day,
Berny

bug#32099: Benchmarks: Hashing ~70% faster (Was: New uniq option for speed)

Reply via email to