John Darrington <[email protected]> writes: > One small issue I've just noticed: > > hash = hsh_hash_string (tocode) ^ hsh_hash_string (fromcode); > > Since the ^ operator is commutative, won't this create the same hash > for complementary convertors? It's going to be very common to have, > say, a utf8-to-latin1 convertor and a latin1-to-utf8 convertor > concurrently.
Good point. I've been meaning to replace the PSPP hash functions for a while now. The FNV hash is not so great, and our implementations lack a "basis" or "initval" argument that can be used to combine hashes in a high-quality way (e.g. not XOR of their results). So I've pushed a branch for review that fixes these problems. It's named "hash", and here is the summary: commit b4e3275011982e29b80589bef705fc8a0a0316dd Author: Ben Pfaff <[email protected]> Date: Wed Apr 8 21:39:22 2009 -0700 NPAR TESTS: Consistently order variables in summary statistics. The set of variables in the NPAR TESTS specs structure was ordered randomly, according to however the hash function happened to arrange them. Sort them by variable name, instead, so that they always appear in alphabetical order in, e.g., descriptive statistics output. The particular hash function PSPP uses now tends to order variables alphabetically anyhow. The next commit changes the PSPP hash functions, so fixing this in advance prevents having to update any test output. src/language/stats/npar.q | 4 ++-- 1 files changed, 2 insertions(+), 2 deletions(-) commit e9c717e43278364a49b68db4718cab5c9229c8fb Author: Ben Pfaff <[email protected]> Date: Wed Apr 8 21:55:31 2009 -0700 Use Bob Jenkins lookup3 hash instead of FNV. The Jenkins lookup3 hash is superior to FNV in collision resistance, avalanching, and performance on systems that do not have fast multiplication. It also provides a good way to combine the result of a previous hashing step with the current hash, using its "basis" argument. This commit replaces the PSPP implementation of FNV with the Jenkins lookup3 hash and updates all the current users. In addition, John Darrington pointed out that commit dd2e61b4a "Make create_iconv() properly distinguish converters by name" unintentionally introduced gratuitous hash collisions, by causing all converters where tocode and fromcode were the same to hash to value 0, and converters where tocode and fromcode were swapped to hash to the same value as each other. Using the "basis" argument to the Jenkins hash properly, instead of just attempting to combine hash values with XOR, fixes this problem. src/data/attributes.c | 6 +- src/data/file-handle-def.c | 12 ++- src/data/file-name.c | 6 +- src/data/short-names.c | 8 +- src/data/value-labels.c | 8 +- src/data/value.c | 6 +- src/data/variable.c | 4 +- src/language/stats/autorecode.c | 4 +- src/language/stats/crosstabs.q | 2 +- src/libpspp/hash-functions.c | 196 ++++++++++++++++++++++++++++----------- src/libpspp/hash-functions.h | 12 +- src/libpspp/i18n.c | 2 +- src/math/covariance-matrix.c | 11 +- 13 files changed, 180 insertions(+), 97 deletions(-) -- "Premature optimization is the root of all evil." --D. E. Knuth, "Structured Programming with go to Statements" _______________________________________________ pspp-dev mailing list [email protected] http://lists.gnu.org/mailman/listinfo/pspp-dev
