> Begin forwarded message:
>
> Subject: Re: [Cdk-user] questions about the hash fingerprinters
> From: John May <j...@nextmovesoftware.com>
> Date: 28 November 2014 09:55:56 GMT
> To: Andrew Dalke <da...@dalkescientific.com>
>
> Hi Andrew,
>
> Firstly it should be noted the fingerprint module is in need of a serious
> overhaul. This is shame as it’s one of the most used features. In fact I’d
> already decided it was beyond repair and had started drafting out a complete
> rewrite.That’s still in progress though.
>
>> On 21 Oct 2014, at 02:29, Andrew Dalke <da...@dalkescientific.com> wrote:
>>
>> Hi all,
>>
>> I've been reviewing the history of hash-based fingerprinting systems, i.e.,
>> those derived from ideas first implemented by Daylight.
>>
>> The primary documentation is at
>> http://www.daylight.com/dayhtml/doc/theory/theory.finger.html - I've been
>> unable to find any publication on it. The main detail I'm investigating is:
>>
>> Instead, each pattern serves as a seed to a pseudo-random number
>> generator (it is "hashed"), the output of which is a set of bits
>> (typically 4 or 5 bits per pattern); the set of bits thus produced
>> is added (with a logical OR) to the fingerprint.
>>
>> Daylight sets multiple bits to reduce the risk that two fragments will have
>> identical signatures. Oddly, I've come across no research which describes
>> how to choose if it's 4 or 5 or something else. RDKit, for example, uses
>> nBitsPerHash=2. Indigo uses a number based on the size of the subgraph and
>> its symmetry. Smaller and more symmetric subgraphs get fewer bits.
>>
>> In talking to people who have used the Daylight toolkit, Daylight also used
>> a variable number of bits, based on the size of the linear fragment.
>>
>> I look at fingerprint/Fingerprinter.java to see what what CDK does, and I
>> would like to confirm that I understand it, then make some observations.
>> (The relevant code is also in HybridizationFingerprinter.java.)
>>
>> The method getLimitedPathsOfLengthUpto() returns a list of {list of path
>> atoms}.
>>
>> For each {list of path atoms}:
>> get the string of the atom and bond symbols, as A0+B01+A1+B12+A3+
>> compute also the reverse, in order to determine a canonical ordering
>> use Java's string hash function to get a hash value
>> add that value to the return list
>>
>> For each returned {hash value}:
>> seed a new java.util.Random(hash)
>> get the first integer from the seeded PRNG
>> set that bit in the fingerprint
>>
>> If I'm correct, then this is somewhat different than Daylight's hash method
>> because it only uses one bit per hash, rather than "typically 4 or 5 bits
>> per pattern".
>>
>> 1) What is the reason CDK decided to get only one random value and hence set
>> only one bit?
>
> I think nobody knew when writing it :-).
>
>> 2) Given that there is only one bit, what does the PRNG add to the
>> calculations? The PRNG ends up mapping one value randomly to another value,
>> but it doesn't change the bit distribution, other than introducing slightly
>> more bit collisions. What effect does it have on the usefulness of the
>> fingerprint vs. just using the hash value directly?
>
> Again no idea why that’s in there. I think this was recently fixed as it was
> creating a new random instance for each and every bit! Actually it might
> still be doing that as the code is copied in multiple places. I think the
> point of PRNG was an attempt to reduce collisions but as you point out it
> doesn’t make sense.
>
>> 3) The implementation builds and discards an intermediate string
>> representation for the fingerprint. I see that the first atom in the path
>> has a special case for how to get the hash value:
>>
>> if (x instanceof IPseudoAtom)
>> sb.append('0');
>> else {
>> Integer atnum =
>> PeriodicTable.getAtomicNumber(x.getSymbol());
>> if (atnum > 0)
>> sb.append((char) atnum.intValue());
>> else
>> sb.append('0');
>> }
>>
>> but the remaining atoms have simply:
>>
>> sb.append(convertSymbol(y[0].getSymbol()));
>>
>> Is there something special about the first atom to require those special
>> checks?
>
> Good spot.
>
>> 4) The string it builds looks something like "C=C-O", and the reverse is
>> "O-C=C". The lexicographically first of these is used for the hash string.
>> What happens if an atom has two characters, like "C-Al"? The reverse of this
>> is "lA-C", so it chooses "C-Al". But if the fragment were "Al-C" then the
>> reverse is "C-lA", and it would choose "C-lA".
>>
>> It seems like there's a problem, since "C-Al" and "Al-C" should return the
>> same hash. That's where convertSymbol() comes to the rescue. It normalizes
>> "Al" to "A", as well as some of the other two letter elements. So this is
>> converted to "C-A" vs. "A-C", and all is okay.
>
> Yes a problem with “stingy” programming really the paths should be hashed
> directly as their discovered.
>
>> Except convertSymbol() is incomplete. It only normalizes Cl, Br, Si, As, Li,
>> Se, Na, Ca, and Al. It doesn't support, say, Au or Pb. My reading of the
>> code suggests that some rare molecules (including those with rare earths :)
>> may have fingerprints which are determined by the input atom order, which
>> shouldn't be the case for a fingerprint.
>>
>> The best solution I can think of is to collect the bond symbols into their
>> own list. Then use one scan through the atom symbols from each end, to
>> determine the canonical best order. If there's a tie then do the same with
>> the bond symbols list. One the direction has been determined, generate a
>> hash from traversing the atom and bond symbols in the correct direction.
>>
>> 5) There's a cleanup step that I don't understand,
>>
>> for (StringBuffer s : allPaths) {
>> String s1 = s.toString().trim();
>> if (s1.equals("")) continue;
>> if (cleanPath.contains(s1)) continue;
>> String s2 = s.reverse().toString().trim();
>> if (cleanPath.contains(s2)) continue;
>> cleanPath.add(s2);
>> }
>>
>> By construction, when will trim() ever do anything? I think only when the
>> terminal atoms have a symbol of " " or other whitespace, which shouldn't
>> ever happen, and if it were, I would think that "\t" and " " should mean
>> different things.
>>
>> The only way I can see to get the "" is if all of the atoms are whitespace,
>> and all of the bonds (if there are any) are quadruple.
>>
>> (BTW, there may need to be a QUADRUPLE case for the getBondSymbol()
>> implementation.)
>
> Yeah… we technically also have bond order 5 and 6 but you can’t store that in
> anything standard….
>
>> What I don't understand though is the additional reverse(). The canonical
>> string was already selected, so what does s2 add?
>
> Neither do I.
>
>> 6) Finally, I think the cleanup code is is better merged with the main
>> fragment enumeration loop.
>>
>> As it stands, if there are 100 identical fragments (e.g., "C.C.C. ... .C")
>> then it builds up a list of 100 StringBuffer instances, then converts them
>> to a Set<String> with one instance, then converts that to a int[] of hash
>> values.
>>
>> If the Set<String> is moved up, and the unique test done for each canonical
>> fragment string as soon as it's created, then there would be no need for an
>> intermediate StringBuffer list.
>>
>> Or, since the downstream code will (re)set the already set bits if it sees
>> the same hash code twice, then there's no need to distinguish between two
>> strings with the same hash code, and this could be replaced with a Set<int>
>> containing the hash codes.
>
> Indeed as I said… needs a rewrite. Actually I should go round with the
> deprecation tag on some of these classes. Thanks for looking in to this in
> detail and we will hopefully have something better in future.
>
> Best regards,
> John
>
>> Best regards,
>>
>> Andrew
>> da...@dalkescientific.com
>> <mailto:da...@dalkescientific.com>
------------------------------------------------------------------------------
Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
from Actuate! Instantly Supercharge Your Business Reports and Dashboards
with Interactivity, Sharing, Native Excel Exports, App Integration & more
Get technology previously reserved for billion-dollar corporations, FREE
http://pubads.g.doubleclick.net/gampad/clk?id=157005751&iu=/4140/ostg.clktrk
_______________________________________________
Cdk-user mailing list
Cdk-user@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/cdk-user