Hi Yihan,

Could you open an issue on GitHub, there are some small changes we could
make to match more closely.

The PubChem CACTVS fingerprint implementation is private and so it's not
possible to match exactly based on code. However it should be "relatively"
close to what has been documented:
https://ftp.ncbi.nlm.nih.gov/pubchem/specifications/pubchem_fingerprints.txt

For some reason (we should change that) you need to make hydrogens explicit
for the fingerprint:

SmilesParser smipar = new
> SmilesParser(SilentChemObjectBuilder.getInstance());
> IAtomContainer mol =
> smipar.parseSmiles("CC1(C(N2C(S1)C(C2=O)NC(=O)C(C3=CC=CC=C3)N)C(=O)O)C");
> AtomContainerManipulator.convertImplicitToExplicitHydrogens(mol);
> BitSet fp = new
> PubchemFingerprinter(SilentChemObjectBuilder.getInstance()).getBitFingerprint(mol)
>
>      .asBitSet();
> System.out.println(fp);
>

That takes care of (0,1,2):

{0, 1, 2, 9, 10, 11, 12, 14, 15, 18, 19, 20, 33, 129, 131, 132, 143, 145,
146, 178, 179, 255, 283, 284, 285, 286, 293, 299, 308, 332, 333, 338, 340,
344, 345, 349, 351, 352, 353, 355, 356, 365, 368, 370, 371, 374, 380, 384,
390, 391, 392, 393, 406, 412, 416, 420, 430, 434, 439, 440, 441, 443, 446,
451, 452, 464, 470, 489, 490, 507, 516, 520, 524, 528, 535, 536, 540, 549,
552, 556, 564, 566, 569, 570, 578, 579, 580, 582, 584, 586, 592, 595, 597,
599, 602, 603, 607, 608, 611, 613, 617, 618, 633, 634, 637, 640, 643, 645,
646, 656, 658, 659, 660, 664, 668, 677, 678, 679, 683, 684, 688, 692, 696,
704, 708, 709, 710}

The other different bits are to do with which ringset we use:

213 >= 1 any ring size 7
215 >= 1 saturated or aromatic nitrogen-containing ring size 7
216 >= 1 saturated or aromatic heteroatom-containing ring size 7

IIRC PubChem/CACTVS substructure keys use a different cycle definition
(based on shortest cycle through triples i.e. atom-bond-atom-bond-atom)
rather than SSSR/MCB. We didn't have the option to find these when the
fingerprint was first written but we do now. We can make this small change:

PubChemFingerprint.java:
        public CountRings(IAtomContainer m) {
            // ringSet = Cycles.sssr(m).toRingSet(); // wrong
            ringSet = Cycles.tripletShort(m).toRingSet();
        }

and get the expected bits set:

{0, 1, 2, 9, 10, 11, 12, 14, 15, 18, 19, 20, 33, 129, 131, 132, 143, 145,
146, 178, 179, 213, 215, 216, 255, 283, 284, 285, 286, 293, 299, 308, 332,
333, 338, 340, 344, 345, 349, 351, 352, 353, 355, 356, 365, 368, 370, 371,
374, 380, 384, 390, 391, 392, 393, 406, 412, 416, 420, 430, 434, 439, 440,
441, 443, 446, 451, 452, 464, 470, 489, 490, 507, 516, 520, 524, 528, 535,
536, 540, 549, 552, 556, 564, 566, 569, 570, 578, 579, 580, 582, 584, 586,
592, 595, 597, 599, 602, 603, 607, 608, 611, 613, 617, 618, 633, 634, 637,
640, 643, 645, 646, 656, 658, 659, 660, 664, 668, 677, 678, 679, 683, 684,
688, 692, 696, 704, 708, 709, 710}

I suggest we add an option to the fingerprint use the correct ring set, but
we should also check for other discrepancies in PubChem (i.e. please open a
GitHub issue).

John

On Wed, 19 Apr 2023 at 09:09, Yihan Wu <yih...@alumni.princeton.edu> wrote:

> Hi,
>
> I've come across a discrepancy between the pubchem fingerprint obtained
> through CDK (calculated from SMILES) and the pubchem fingerprint extracted
> directly from the pubchem website. For example, the Canonical SMILES of
>  compound Ampicillin (pubchem CID 6249) is
> CC1(C(N2C(S1)C(C2=O)NC(=O)C(C3=CC=CC=C3)N)C(=O)O)C.
> The calculation of pubchem fingerprint based on this SMILES by CDK is
>
> 00000000011110110011100000000000010000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000010110000000000101100000000000000000000000000000001100000000000000000000000000000000000000000000000000000000000000000000000000010000000000000000000000000001111000000100000100000000100000000000000000000000110000101000110001011101100000000100101100100000100010000011110000000000001000001000100010000000001000100001110100100001100000000000100000100000000000000000011000000000000000010000000010001000100010000001100010000000010010001000000010100110000000111010101000001001010100110001100101000110000000000000011001001001011000000000101110001000100000000111000110001000100010000000100011100000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
> The pubchem fingerprint extracted from pubchem website for this compound is
>
> 11100000011110110011100000000000010000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000010110000000000101100000000000000000000000000000001100000000000000000000000000000000010110000000000000000000000000000000000000010000000000000000000000000001111000000100000100000000100000000000000000000000110000101000110001011101100000000100101100100000100010000011110000000000001000001000100010000000001000100001110100100001100000000000100000100000000000000000011000000000000000010000000010001000100010000001100010000000010010001000000010100110000000111010101000001001010100110001100101000110000000000000011001001001011000000000101110001000100000000111000110001000100010000000100011100000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
> The two fingerprints differ on positions 0, 1, 2, 213, 215, and 216.
> Have any of you encountered a similar issue or could anyone identify what
> mistake I may have made? Any assistance provided would be greatly
> appreciated!
>
> Thank you,
> Yihan
>
>
> <https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail>
> 无病毒。www.avast.com
> <https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail>
> <#m_3374493361141601686_DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>
> _______________________________________________
> Cdk-user mailing list
> Cdk-user@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/cdk-user
>
_______________________________________________
Cdk-user mailing list
Cdk-user@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/cdk-user

Reply via email to