> IMO much of the time overhead is basically unavoidable, as you need to > do lots of random probes to a huge table (order of million entries, > e.g. 1.5M in my case). Putting a single Bloom filter in front would > shave I/O of most *failed* probes, but otherwise we can't get much > better, I believe (being a PhD student in data structures).
There is some overhead, but it doesn't have to be anywhere near the one we have right now. The filesystem stores a lot more information than necessary. Basically making the on-disk structure smaller and more domain-specific makes the operation noticably faster, and I can't think of a smaller structure to store hash-maps than patricia trees. A simple hash-to-block mapping is all we need. I'm not sure whether the Bloom filter would really help. Since it is impossible to predict the required size you would have to use some form of growing Bloom filter or proactively create a filter that is so large that it might destroy its own purpose. > Anyway, this deduplication work seems better delegated to some other > SW/library. Surely there are some already. If we wanted to deduplicate > *similar* (unequal) files, we would even need to get FS-dependent. In > the typical ext4 case, we don't even have reflinks AFAIK... That's what I'm saying. A separate tool should handle this, and it would depend on the FS (in the sense of being a function of the FS), but not be bound to any particular one. For traditional FSes like ext4 it would do hard-linking, but perhaps in a slightly smarter way. Greets, Ertugrul
signature.asc
Description: PGP signature
_______________________________________________ nix-dev mailing list nix-dev@lists.science.uu.nl http://lists.science.uu.nl/mailman/listinfo/nix-dev