On Friday, 12 February 2021 at 07:23:12 UTC, frame wrote:
On Friday, 12 February 2021 at 02:22:35 UTC, H. S. Teoh wrote:
This turns the OP's O(n log n) algorithm into an O(n)
algorithm, doesn't
need to copy the entire content of the file into memory, and
also uses
much less memory by storing only hashes.
But this kind of hash is maybe insufficient to avoid hash
collisions. For such big data slower but stronger algorithms
like SHA are advisable.
Also associative arrays uses the same weak algorithm where you
can run into collision issues. Thus using the hash from string
data as key can be a problem. I always use a quick hash as key
but hold actually a collection of hashes in them and do a
lookup to be on the safe side.
Forgot to mention that this kind of solution needs a better
approach if you don't want to miss a potential different line:
You can use a weak hash but track the line position and count how
often the same hash occurs as a pre-process. In the post-process
you look for this lines again and compare if they are really
identical or hash collisions to correct.