On Friday, 12 February 2021 at 07:23:12 UTC, frame wrote:
On Friday, 12 February 2021 at 02:22:35 UTC, H. S. Teoh wrote:

This turns the OP's O(n log n) algorithm into an O(n) algorithm, doesn't need to copy the entire content of the file into memory, and also uses
much less memory by storing only hashes.

But this kind of hash is maybe insufficient to avoid hash collisions. For such big data slower but stronger algorithms like SHA are advisable.

Also associative arrays uses the same weak algorithm where you can run into collision issues. Thus using the hash from string data as key can be a problem. I always use a quick hash as key but hold actually a collection of hashes in them and do a lookup to be on the safe side.

Forgot to mention that this kind of solution needs a better approach if you don't want to miss a potential different line:

You can use a weak hash but track the line position and count how often the same hash occurs as a pre-process. In the post-process you look for this lines again and compare if they are really identical or hash collisions to correct.

Reply via email to