Hi Yihan, Could you open an issue on GitHub, there are some small changes we could make to match more closely.
The PubChem CACTVS fingerprint implementation is private and so it's not possible to match exactly based on code. However it should be "relatively" close to what has been documented: https://ftp.ncbi.nlm.nih.gov/pubchem/specifications/pubchem_fingerprints.txt For some reason (we should change that) you need to make hydrogens explicit for the fingerprint: SmilesParser smipar = new > SmilesParser(SilentChemObjectBuilder.getInstance()); > IAtomContainer mol = > smipar.parseSmiles("CC1(C(N2C(S1)C(C2=O)NC(=O)C(C3=CC=CC=C3)N)C(=O)O)C"); > AtomContainerManipulator.convertImplicitToExplicitHydrogens(mol); > BitSet fp = new > PubchemFingerprinter(SilentChemObjectBuilder.getInstance()).getBitFingerprint(mol) > > .asBitSet(); > System.out.println(fp); > That takes care of (0,1,2): {0, 1, 2, 9, 10, 11, 12, 14, 15, 18, 19, 20, 33, 129, 131, 132, 143, 145, 146, 178, 179, 255, 283, 284, 285, 286, 293, 299, 308, 332, 333, 338, 340, 344, 345, 349, 351, 352, 353, 355, 356, 365, 368, 370, 371, 374, 380, 384, 390, 391, 392, 393, 406, 412, 416, 420, 430, 434, 439, 440, 441, 443, 446, 451, 452, 464, 470, 489, 490, 507, 516, 520, 524, 528, 535, 536, 540, 549, 552, 556, 564, 566, 569, 570, 578, 579, 580, 582, 584, 586, 592, 595, 597, 599, 602, 603, 607, 608, 611, 613, 617, 618, 633, 634, 637, 640, 643, 645, 646, 656, 658, 659, 660, 664, 668, 677, 678, 679, 683, 684, 688, 692, 696, 704, 708, 709, 710} The other different bits are to do with which ringset we use: 213 >= 1 any ring size 7 215 >= 1 saturated or aromatic nitrogen-containing ring size 7 216 >= 1 saturated or aromatic heteroatom-containing ring size 7 IIRC PubChem/CACTVS substructure keys use a different cycle definition (based on shortest cycle through triples i.e. atom-bond-atom-bond-atom) rather than SSSR/MCB. We didn't have the option to find these when the fingerprint was first written but we do now. We can make this small change: PubChemFingerprint.java: public CountRings(IAtomContainer m) { // ringSet = Cycles.sssr(m).toRingSet(); // wrong ringSet = Cycles.tripletShort(m).toRingSet(); } and get the expected bits set: {0, 1, 2, 9, 10, 11, 12, 14, 15, 18, 19, 20, 33, 129, 131, 132, 143, 145, 146, 178, 179, 213, 215, 216, 255, 283, 284, 285, 286, 293, 299, 308, 332, 333, 338, 340, 344, 345, 349, 351, 352, 353, 355, 356, 365, 368, 370, 371, 374, 380, 384, 390, 391, 392, 393, 406, 412, 416, 420, 430, 434, 439, 440, 441, 443, 446, 451, 452, 464, 470, 489, 490, 507, 516, 520, 524, 528, 535, 536, 540, 549, 552, 556, 564, 566, 569, 570, 578, 579, 580, 582, 584, 586, 592, 595, 597, 599, 602, 603, 607, 608, 611, 613, 617, 618, 633, 634, 637, 640, 643, 645, 646, 656, 658, 659, 660, 664, 668, 677, 678, 679, 683, 684, 688, 692, 696, 704, 708, 709, 710} I suggest we add an option to the fingerprint use the correct ring set, but we should also check for other discrepancies in PubChem (i.e. please open a GitHub issue). John On Wed, 19 Apr 2023 at 09:09, Yihan Wu <yih...@alumni.princeton.edu> wrote: > Hi, > > I've come across a discrepancy between the pubchem fingerprint obtained > through CDK (calculated from SMILES) and the pubchem fingerprint extracted > directly from the pubchem website. For example, the Canonical SMILES of > compound Ampicillin (pubchem CID 6249) is > CC1(C(N2C(S1)C(C2=O)NC(=O)C(C3=CC=CC=C3)N)C(=O)O)C. > The calculation of pubchem fingerprint based on this SMILES by CDK is > > 00000000011110110011100000000000010000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000010110000000000101100000000000000000000000000000001100000000000000000000000000000000000000000000000000000000000000000000000000010000000000000000000000000001111000000100000100000000100000000000000000000000110000101000110001011101100000000100101100100000100010000011110000000000001000001000100010000000001000100001110100100001100000000000100000100000000000000000011000000000000000010000000010001000100010000001100010000000010010001000000010100110000000111010101000001001010100110001100101000110000000000000011001001001011000000000101110001000100000000111000110001000100010000000100011100000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000 > The pubchem fingerprint extracted from pubchem website for this compound is > > 11100000011110110011100000000000010000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000010110000000000101100000000000000000000000000000001100000000000000000000000000000000010110000000000000000000000000000000000000010000000000000000000000000001111000000100000100000000100000000000000000000000110000101000110001011101100000000100101100100000100010000011110000000000001000001000100010000000001000100001110100100001100000000000100000100000000000000000011000000000000000010000000010001000100010000001100010000000010010001000000010100110000000111010101000001001010100110001100101000110000000000000011001001001011000000000101110001000100000000111000110001000100010000000100011100000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000 > The two fingerprints differ on positions 0, 1, 2, 213, 215, and 216. > Have any of you encountered a similar issue or could anyone identify what > mistake I may have made? Any assistance provided would be greatly > appreciated! > > Thank you, > Yihan > > > <https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail> > 无病毒。www.avast.com > <https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail> > <#m_3374493361141601686_DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2> > _______________________________________________ > Cdk-user mailing list > Cdk-user@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/cdk-user >
_______________________________________________ Cdk-user mailing list Cdk-user@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/cdk-user