On 07/10/2018 08:21 PM, Assaf Gordon wrote: > I would suggest several improvements to the measurements, to ensure > relevant information is conveyed: > > 1. > [...]
great summary. With the risk of mentioning already said aspects: 7. Consider all the existing options, i.e. processing modes, of 'uniq' as well. This means, it has to be considered (upfront!) whether introducing an alternate way to do things - in this case hashing - only serves for the trivial case, or whether this would slow down or even contradict the processing with other options, currently: -c, --count prefix lines by the number of occurrences -d, --repeated only print duplicate lines, one for each group -D print all duplicate lines --all-repeated[=METHOD] like -D, but allow separating groups with an empty line; METHOD={none(default),prepend,separate} -f, --skip-fields=N avoid comparing the first N fields --group[=METHOD] show all items, separating groups with an empty line; METHOD={separate(default),prepend,append,both} -i, --ignore-case ignore differences in case when comparing -s, --skip-chars=N avoid comparing the first N characters -u, --unique only print unique lines -z, --zero-terminated line delimiter is NUL, not newline -w, --check-chars=N compare no more than N characters in lines Without deeper thinking about it, especially the combinations with the --group, --repeated and --unique options might be problematic. 8. Standards. POSIX says: http://pubs.opengroup.org/onlinepubs/9699919799/utilities/uniq.html The uniq utility shall read an input file comparing adjacent lines, and write one copy of each input line on the output. The second and succeeding copies of repeated adjacent input lines shall not be written. This makes an assumption both on the input and output. Regarding the input, this means that the processing just has to remember the previous line in order to decide whether to print/discard it in the output. Therefore, arbitrary input size is fine. Hashing most probably will have issues with arbitrary input size; I do not talk about 1GB files, but _much_ larger files (yes, we are in 2018 where 1GB is like nothing). Regarding the output, this means that the output is implicitly in sort order as well. Like the input, uniq can discard the already written data from memory because it is sure that uniq doesn't need to consider it anymore. Thus said, a successful optimization idea does not only have to show that it is faster or needs less resources in _some_ cases, but also must prove that it does not tear down many cases including extreme ones. The current implementation might be as-is for a good reason. If it turns out that the optimization idea screws up a single of the above use cases, then the dilemma is whether 2 implementations are warranted to be kept (maintenance!), and whether it is possible to detect the extreme cases early enough to switch from the default to the other processing strategy. https://en.wikipedia.org/wiki/Perfect_is_the_enemy_of_good or D. Knuth: "premature optimization is the root of all evil (or at least most of it) in programming" As Assaf said, a patch with a proof of concept would be helpful ... even if it's just helpful to proof that the current implementation is fine. Have a nice day, Berny