[Cdk-user] questions about the hash fingerprinters

Andrew Dalke Mon, 20 Oct 2014 18:30:12 -0700

Hi all,

  I've been reviewing the history of hash-based fingerprinting systems, i.e., 
those derived from ideas first implemented by Daylight.


The primary documentation is at 
http://www.daylight.com/dayhtml/doc/theory/theory.finger.html - I've been 
unable to find any publication on it. The main detail I'm investigating is:

     Instead, each pattern serves as a seed to a pseudo-random number
     generator (it is "hashed"), the output of which is a set of bits
     (typically 4 or 5 bits per pattern); the set of bits thus produced
     is added (with a logical OR) to the fingerprint.

Daylight sets multiple bits to reduce the risk that two fragments will have 
identical signatures. Oddly, I've come across no research which describes how 
to choose if it's 4 or 5 or something else. RDKit, for example, uses 
nBitsPerHash=2. Indigo uses a number based on the size of the subgraph and its 
symmetry. Smaller and more symmetric subgraphs get fewer bits.

In talking to people who have used the Daylight toolkit, Daylight also used a 
variable number of bits, based on the size of the linear fragment.

I look at fingerprint/Fingerprinter.java to see what what CDK does, and I would 
like to confirm that I understand it, then make some observations. (The 
relevant code is also in HybridizationFingerprinter.java.)

The method getLimitedPathsOfLengthUpto() returns a list of {list of path atoms}.

For each {list of path atoms}:
  get the string of the atom and bond symbols, as A0+B01+A1+B12+A3+
  compute also the reverse, in order to determine a canonical ordering
  use Java's string hash function to get a hash value
  add that value to the return list

For each returned {hash value}:
  seed a new java.util.Random(hash)
  get the first integer from the seeded PRNG
  set that bit in the fingerprint

If I'm correct, then this is somewhat different than Daylight's hash method 
because it only uses one bit per hash, rather than "typically 4 or 5 bits per 
pattern".

1) What is the reason CDK decided to get only one random value and hence set 
only one bit?

2) Given that there is only one bit, what does the PRNG add to the 
calculations? The PRNG ends up mapping one value randomly to another value, but 
it doesn't change the bit distribution, other than introducing slightly more 
bit collisions. What effect does it have on the usefulness of the fingerprint 
vs. just using the hash value directly?

3) The implementation builds and discards an intermediate string representation 
for the fingerprint. I see that the first atom in the path has a special case 
for how to get the hash value:

                if (x instanceof IPseudoAtom)
                    sb.append('0');
                else {
                    Integer atnum = 
PeriodicTable.getAtomicNumber(x.getSymbol());
                    if (atnum > 0)
                        sb.append((char) atnum.intValue());
                    else
                        sb.append('0');
                }

but the remaining atoms have simply:

                    sb.append(convertSymbol(y[0].getSymbol()));

Is there something special about the first atom to require those special checks?

4) The string it builds looks something like "C=C-O", and the reverse is 
"O-C=C". The lexicographically first of these is used for the hash string. What 
happens if an atom has two characters, like "C-Al"? The reverse of this is 
"lA-C", so it chooses "C-Al". But if the fragment were "Al-C" then the reverse 
is "C-lA", and it would choose "C-lA".

It seems like there's a problem, since "C-Al" and "Al-C" should return the same 
hash. That's where convertSymbol() comes to the rescue. It normalizes "Al" to 
"A", as well as some of the other two letter elements. So this is converted to 
"C-A" vs. "A-C", and all is okay.

Except convertSymbol() is incomplete. It only normalizes Cl, Br, Si, As, Li, 
Se, Na, Ca, and Al. It doesn't support, say, Au or Pb. My reading of the code 
suggests that some rare molecules (including those with rare earths :) may have 
fingerprints which are determined by the input atom order, which shouldn't be 
the case for a fingerprint.

The best solution I can think of is to collect the bond symbols into their own 
list. Then use one scan through the atom symbols from each end, to determine 
the canonical best order. If there's a tie then do the same with the bond 
symbols list. One the direction has been determined, generate a hash from 
traversing the atom and bond symbols in the correct direction.

5) There's a cleanup step that I don't understand, 

        for (StringBuffer s : allPaths) {
            String s1 = s.toString().trim();
            if (s1.equals("")) continue;
            if (cleanPath.contains(s1)) continue;
            String s2 = s.reverse().toString().trim();
            if (cleanPath.contains(s2)) continue;
            cleanPath.add(s2);
        }

By construction, when will trim() ever do anything? I think only when the 
terminal atoms have a symbol of " " or other whitespace, which shouldn't ever 
happen, and if it were, I would think that "\t" and " " should mean different 
things.

The only way I can see to get the "" is if all of the atoms are whitespace, and 
all of the bonds (if there are any) are quadruple.

(BTW, there may need to be a QUADRUPLE case for the getBondSymbol() 
implementation.)

What I don't understand though is the additional reverse(). The canonical 
string was already selected, so what does s2 add?

6) Finally, I think the cleanup code is is better merged with the main fragment 
enumeration loop.

As it stands, if there are 100 identical fragments (e.g., "C.C.C. ... .C") then 
it builds up a list of 100 StringBuffer instances, then converts them to a 
Set<String> with one instance, then converts that to a int[] of hash values.

If the Set<String> is moved up, and the unique test done for each canonical 
fragment string as soon as it's created, then there would be no need for an 
intermediate StringBuffer list.

Or, since the downstream code will (re)set the already set bits if it sees the 
same hash code twice, then there's no need to distinguish between two strings 
with the same hash code, and this could be replaced with a Set<int> containing 
the hash codes.

Best regards,
  
                                Andrew
                                da...@dalkescientific.com



------------------------------------------------------------------------------
Comprehensive Server Monitoring with Site24x7.
Monitor 10 servers for $9/Month.
Get alerted through email, SMS, voice calls or mobile push notifications.
Take corrective actions from your mobile device.
http://p.sf.net/sfu/Zoho
_______________________________________________
Cdk-user mailing list
Cdk-user@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/cdk-user

[Cdk-user] questions about the hash fingerprinters

Reply via email to