Hi all, I've been reviewing the history of hash-based fingerprinting systems, i.e., those derived from ideas first implemented by Daylight.
The primary documentation is at http://www.daylight.com/dayhtml/doc/theory/theory.finger.html - I've been unable to find any publication on it. The main detail I'm investigating is: Instead, each pattern serves as a seed to a pseudo-random number generator (it is "hashed"), the output of which is a set of bits (typically 4 or 5 bits per pattern); the set of bits thus produced is added (with a logical OR) to the fingerprint. Daylight sets multiple bits to reduce the risk that two fragments will have identical signatures. Oddly, I've come across no research which describes how to choose if it's 4 or 5 or something else. RDKit, for example, uses nBitsPerHash=2. Indigo uses a number based on the size of the subgraph and its symmetry. Smaller and more symmetric subgraphs get fewer bits. In talking to people who have used the Daylight toolkit, Daylight also used a variable number of bits, based on the size of the linear fragment. I look at fingerprint/Fingerprinter.java to see what what CDK does, and I would like to confirm that I understand it, then make some observations. (The relevant code is also in HybridizationFingerprinter.java.) The method getLimitedPathsOfLengthUpto() returns a list of {list of path atoms}. For each {list of path atoms}: get the string of the atom and bond symbols, as A0+B01+A1+B12+A3+ compute also the reverse, in order to determine a canonical ordering use Java's string hash function to get a hash value add that value to the return list For each returned {hash value}: seed a new java.util.Random(hash) get the first integer from the seeded PRNG set that bit in the fingerprint If I'm correct, then this is somewhat different than Daylight's hash method because it only uses one bit per hash, rather than "typically 4 or 5 bits per pattern". 1) What is the reason CDK decided to get only one random value and hence set only one bit? 2) Given that there is only one bit, what does the PRNG add to the calculations? The PRNG ends up mapping one value randomly to another value, but it doesn't change the bit distribution, other than introducing slightly more bit collisions. What effect does it have on the usefulness of the fingerprint vs. just using the hash value directly? 3) The implementation builds and discards an intermediate string representation for the fingerprint. I see that the first atom in the path has a special case for how to get the hash value: if (x instanceof IPseudoAtom) sb.append('0'); else { Integer atnum = PeriodicTable.getAtomicNumber(x.getSymbol()); if (atnum > 0) sb.append((char) atnum.intValue()); else sb.append('0'); } but the remaining atoms have simply: sb.append(convertSymbol(y[0].getSymbol())); Is there something special about the first atom to require those special checks? 4) The string it builds looks something like "C=C-O", and the reverse is "O-C=C". The lexicographically first of these is used for the hash string. What happens if an atom has two characters, like "C-Al"? The reverse of this is "lA-C", so it chooses "C-Al". But if the fragment were "Al-C" then the reverse is "C-lA", and it would choose "C-lA". It seems like there's a problem, since "C-Al" and "Al-C" should return the same hash. That's where convertSymbol() comes to the rescue. It normalizes "Al" to "A", as well as some of the other two letter elements. So this is converted to "C-A" vs. "A-C", and all is okay. Except convertSymbol() is incomplete. It only normalizes Cl, Br, Si, As, Li, Se, Na, Ca, and Al. It doesn't support, say, Au or Pb. My reading of the code suggests that some rare molecules (including those with rare earths :) may have fingerprints which are determined by the input atom order, which shouldn't be the case for a fingerprint. The best solution I can think of is to collect the bond symbols into their own list. Then use one scan through the atom symbols from each end, to determine the canonical best order. If there's a tie then do the same with the bond symbols list. One the direction has been determined, generate a hash from traversing the atom and bond symbols in the correct direction. 5) There's a cleanup step that I don't understand, for (StringBuffer s : allPaths) { String s1 = s.toString().trim(); if (s1.equals("")) continue; if (cleanPath.contains(s1)) continue; String s2 = s.reverse().toString().trim(); if (cleanPath.contains(s2)) continue; cleanPath.add(s2); } By construction, when will trim() ever do anything? I think only when the terminal atoms have a symbol of " " or other whitespace, which shouldn't ever happen, and if it were, I would think that "\t" and " " should mean different things. The only way I can see to get the "" is if all of the atoms are whitespace, and all of the bonds (if there are any) are quadruple. (BTW, there may need to be a QUADRUPLE case for the getBondSymbol() implementation.) What I don't understand though is the additional reverse(). The canonical string was already selected, so what does s2 add? 6) Finally, I think the cleanup code is is better merged with the main fragment enumeration loop. As it stands, if there are 100 identical fragments (e.g., "C.C.C. ... .C") then it builds up a list of 100 StringBuffer instances, then converts them to a Set<String> with one instance, then converts that to a int[] of hash values. If the Set<String> is moved up, and the unique test done for each canonical fragment string as soon as it's created, then there would be no need for an intermediate StringBuffer list. Or, since the downstream code will (re)set the already set bits if it sees the same hash code twice, then there's no need to distinguish between two strings with the same hash code, and this could be replaced with a Set<int> containing the hash codes. Best regards, Andrew da...@dalkescientific.com ------------------------------------------------------------------------------ Comprehensive Server Monitoring with Site24x7. Monitor 10 servers for $9/Month. Get alerted through email, SMS, voice calls or mobile push notifications. Take corrective actions from your mobile device. http://p.sf.net/sfu/Zoho _______________________________________________ Cdk-user mailing list Cdk-user@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/cdk-user