> Begin forwarded message:
> 
> Subject: Re: [Cdk-user] questions about the hash fingerprinters
> From: John May <j...@nextmovesoftware.com>
> Date: 28 November 2014 09:55:56 GMT
> To: Andrew Dalke <da...@dalkescientific.com>
> 
> Hi Andrew,
> 
> Firstly it should be noted the fingerprint module is in need of a serious 
> overhaul. This is shame as it’s one of the most used features. In fact I’d 
> already decided it was beyond repair and had started drafting out a complete 
> rewrite.That’s still in progress though.
> 
>> On 21 Oct 2014, at 02:29, Andrew Dalke <da...@dalkescientific.com> wrote:
>> 
>> Hi all,
>> 
>> I've been reviewing the history of hash-based fingerprinting systems, i.e., 
>> those derived from ideas first implemented by Daylight.
>> 
>> The primary documentation is at 
>> http://www.daylight.com/dayhtml/doc/theory/theory.finger.html - I've been 
>> unable to find any publication on it. The main detail I'm investigating is:
>> 
>>    Instead, each pattern serves as a seed to a pseudo-random number
>>    generator (it is "hashed"), the output of which is a set of bits
>>    (typically 4 or 5 bits per pattern); the set of bits thus produced
>>    is added (with a logical OR) to the fingerprint.
>> 
>> Daylight sets multiple bits to reduce the risk that two fragments will have 
>> identical signatures. Oddly, I've come across no research which describes 
>> how to choose if it's 4 or 5 or something else. RDKit, for example, uses 
>> nBitsPerHash=2. Indigo uses a number based on the size of the subgraph and 
>> its symmetry. Smaller and more symmetric subgraphs get fewer bits.
>> 
>> In talking to people who have used the Daylight toolkit, Daylight also used 
>> a variable number of bits, based on the size of the linear fragment.
>> 
>> I look at fingerprint/Fingerprinter.java to see what what CDK does, and I 
>> would like to confirm that I understand it, then make some observations. 
>> (The relevant code is also in HybridizationFingerprinter.java.)
>> 
>> The method getLimitedPathsOfLengthUpto() returns a list of {list of path 
>> atoms}.
>> 
>> For each {list of path atoms}:
>> get the string of the atom and bond symbols, as A0+B01+A1+B12+A3+
>> compute also the reverse, in order to determine a canonical ordering
>> use Java's string hash function to get a hash value
>> add that value to the return list
>> 
>> For each returned {hash value}:
>> seed a new java.util.Random(hash)
>> get the first integer from the seeded PRNG
>> set that bit in the fingerprint
>> 
>> If I'm correct, then this is somewhat different than Daylight's hash method 
>> because it only uses one bit per hash, rather than "typically 4 or 5 bits 
>> per pattern".
>> 
>> 1) What is the reason CDK decided to get only one random value and hence set 
>> only one bit?
> 
> I think nobody knew when writing it :-).
> 
>> 2) Given that there is only one bit, what does the PRNG add to the 
>> calculations? The PRNG ends up mapping one value randomly to another value, 
>> but it doesn't change the bit distribution, other than introducing slightly 
>> more bit collisions. What effect does it have on the usefulness of the 
>> fingerprint vs. just using the hash value directly?
> 
> Again no idea why that’s in there. I think this was recently fixed as it was 
> creating a new random instance for each and every bit! Actually it might 
> still be doing that as the code is copied in multiple places. I think the 
> point of PRNG was an attempt to reduce collisions but as you point out it 
> doesn’t make sense.
> 
>> 3) The implementation builds and discards an intermediate string 
>> representation for the fingerprint. I see that the first atom in the path 
>> has a special case for how to get the hash value:
>> 
>>               if (x instanceof IPseudoAtom)
>>                   sb.append('0');
>>               else {
>>                   Integer atnum = 
>> PeriodicTable.getAtomicNumber(x.getSymbol());
>>                   if (atnum > 0)
>>                       sb.append((char) atnum.intValue());
>>                   else
>>                       sb.append('0');
>>               }
>> 
>> but the remaining atoms have simply:
>> 
>>                   sb.append(convertSymbol(y[0].getSymbol()));
>> 
>> Is there something special about the first atom to require those special 
>> checks?
> 
> Good spot.
> 
>> 4) The string it builds looks something like "C=C-O", and the reverse is 
>> "O-C=C". The lexicographically first of these is used for the hash string. 
>> What happens if an atom has two characters, like "C-Al"? The reverse of this 
>> is "lA-C", so it chooses "C-Al". But if the fragment were "Al-C" then the 
>> reverse is "C-lA", and it would choose "C-lA".
>> 
>> It seems like there's a problem, since "C-Al" and "Al-C" should return the 
>> same hash. That's where convertSymbol() comes to the rescue. It normalizes 
>> "Al" to "A", as well as some of the other two letter elements. So this is 
>> converted to "C-A" vs. "A-C", and all is okay.
> 
> Yes a problem with “stingy” programming really the paths should be hashed 
> directly as their discovered.
> 
>> Except convertSymbol() is incomplete. It only normalizes Cl, Br, Si, As, Li, 
>> Se, Na, Ca, and Al. It doesn't support, say, Au or Pb. My reading of the 
>> code suggests that some rare molecules (including those with rare earths :) 
>> may have fingerprints which are determined by the input atom order, which 
>> shouldn't be the case for a fingerprint.
>> 
>> The best solution I can think of is to collect the bond symbols into their 
>> own list. Then use one scan through the atom symbols from each end, to 
>> determine the canonical best order. If there's a tie then do the same with 
>> the bond symbols list. One the direction has been determined, generate a 
>> hash from traversing the atom and bond symbols in the correct direction.
>> 
>> 5) There's a cleanup step that I don't understand, 
>> 
>>       for (StringBuffer s : allPaths) {
>>           String s1 = s.toString().trim();
>>           if (s1.equals("")) continue;
>>           if (cleanPath.contains(s1)) continue;
>>           String s2 = s.reverse().toString().trim();
>>           if (cleanPath.contains(s2)) continue;
>>           cleanPath.add(s2);
>>       }
>> 
>> By construction, when will trim() ever do anything? I think only when the 
>> terminal atoms have a symbol of " " or other whitespace, which shouldn't 
>> ever happen, and if it were, I would think that "\t" and " " should mean 
>> different things.
>> 
>> The only way I can see to get the "" is if all of the atoms are whitespace, 
>> and all of the bonds (if there are any) are quadruple.
>> 
>> (BTW, there may need to be a QUADRUPLE case for the getBondSymbol() 
>> implementation.)
> 
> Yeah… we technically also have bond order 5 and 6 but you can’t store that in 
> anything standard….
> 
>> What I don't understand though is the additional reverse(). The canonical 
>> string was already selected, so what does s2 add?
> 
> Neither do I.
> 
>> 6) Finally, I think the cleanup code is is better merged with the main 
>> fragment enumeration loop.
>> 
>> As it stands, if there are 100 identical fragments (e.g., "C.C.C. ... .C") 
>> then it builds up a list of 100 StringBuffer instances, then converts them 
>> to a Set<String> with one instance, then converts that to a int[] of hash 
>> values.
>> 
>> If the Set<String> is moved up, and the unique test done for each canonical 
>> fragment string as soon as it's created, then there would be no need for an 
>> intermediate StringBuffer list.
>> 
>> Or, since the downstream code will (re)set the already set bits if it sees 
>> the same hash code twice, then there's no need to distinguish between two 
>> strings with the same hash code, and this could be replaced with a Set<int> 
>> containing the hash codes.
> 
> Indeed as I said… needs a rewrite. Actually I should go round with the 
> deprecation tag on some of these classes. Thanks for looking in to this in 
> detail and we will hopefully have something better in future.
> 
> Best regards,
> John
> 
>> Best regards,
>> 
>>                              Andrew
>>                              da...@dalkescientific.com 
>> <mailto:da...@dalkescientific.com>
------------------------------------------------------------------------------
Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
from Actuate! Instantly Supercharge Your Business Reports and Dashboards
with Interactivity, Sharing, Native Excel Exports, App Integration & more
Get technology previously reserved for billion-dollar corporations, FREE
http://pubads.g.doubleclick.net/gampad/clk?id=157005751&iu=/4140/ostg.clktrk
_______________________________________________
Cdk-user mailing list
Cdk-user@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/cdk-user

Reply via email to