Hi John, Thank you very much for the quick reply. I greatly appreciate your help in resolving my confusion! A Github issue has been opened as you suggested. Thank you.
Best, Yihan On Wed, Apr 19, 2023 at 4:45 PM John Mayfield <john.wilkinson...@gmail.com> wrote: > Hi Yihan, > > Could you open an issue on GitHub, there are some small changes we could > make to match more closely. > > The PubChem CACTVS fingerprint implementation is private and so it's not > possible to match exactly based on code. However it should be "relatively" > close to what has been documented: > > https://ftp.ncbi.nlm.nih.gov/pubchem/specifications/pubchem_fingerprints.txt > > For some reason (we should change that) you need to make hydrogens > explicit for the fingerprint: > > SmilesParser smipar = new >> SmilesParser(SilentChemObjectBuilder.getInstance()); >> IAtomContainer mol = >> smipar.parseSmiles("CC1(C(N2C(S1)C(C2=O)NC(=O)C(C3=CC=CC=C3)N)C(=O)O)C"); >> AtomContainerManipulator.convertImplicitToExplicitHydrogens(mol); >> BitSet fp = new >> PubchemFingerprinter(SilentChemObjectBuilder.getInstance()).getBitFingerprint(mol) >> >> .asBitSet(); >> System.out.println(fp); >> > > That takes care of (0,1,2): > > {0, 1, 2, 9, 10, 11, 12, 14, 15, 18, 19, 20, 33, 129, 131, 132, 143, 145, > 146, 178, 179, 255, 283, 284, 285, 286, 293, 299, 308, 332, 333, 338, 340, > 344, 345, 349, 351, 352, 353, 355, 356, 365, 368, 370, 371, 374, 380, 384, > 390, 391, 392, 393, 406, 412, 416, 420, 430, 434, 439, 440, 441, 443, 446, > 451, 452, 464, 470, 489, 490, 507, 516, 520, 524, 528, 535, 536, 540, 549, > 552, 556, 564, 566, 569, 570, 578, 579, 580, 582, 584, 586, 592, 595, 597, > 599, 602, 603, 607, 608, 611, 613, 617, 618, 633, 634, 637, 640, 643, 645, > 646, 656, 658, 659, 660, 664, 668, 677, 678, 679, 683, 684, 688, 692, 696, > 704, 708, 709, 710} > > The other different bits are to do with which ringset we use: > > 213 >= 1 any ring size 7 > 215 >= 1 saturated or aromatic nitrogen-containing ring size 7 > 216 >= 1 saturated or aromatic heteroatom-containing ring size 7 > > IIRC PubChem/CACTVS substructure keys use a different cycle definition > (based on shortest cycle through triples i.e. atom-bond-atom-bond-atom) > rather than SSSR/MCB. We didn't have the option to find these when the > fingerprint was first written but we do now. We can make this small change: > > PubChemFingerprint.java: > public CountRings(IAtomContainer m) { > // ringSet = Cycles.sssr(m).toRingSet(); // wrong > ringSet = Cycles.tripletShort(m).toRingSet(); > } > > and get the expected bits set: > > {0, 1, 2, 9, 10, 11, 12, 14, 15, 18, 19, 20, 33, 129, 131, 132, 143, 145, > 146, 178, 179, 213, 215, 216, 255, 283, 284, 285, 286, 293, 299, 308, 332, > 333, 338, 340, 344, 345, 349, 351, 352, 353, 355, 356, 365, 368, 370, 371, > 374, 380, 384, 390, 391, 392, 393, 406, 412, 416, 420, 430, 434, 439, 440, > 441, 443, 446, 451, 452, 464, 470, 489, 490, 507, 516, 520, 524, 528, 535, > 536, 540, 549, 552, 556, 564, 566, 569, 570, 578, 579, 580, 582, 584, 586, > 592, 595, 597, 599, 602, 603, 607, 608, 611, 613, 617, 618, 633, 634, 637, > 640, 643, 645, 646, 656, 658, 659, 660, 664, 668, 677, 678, 679, 683, 684, > 688, 692, 696, 704, 708, 709, 710} > > I suggest we add an option to the fingerprint use the correct ring set, > but we should also check for other discrepancies in PubChem (i.e. please > open a GitHub issue). > > John > > On Wed, 19 Apr 2023 at 09:09, Yihan Wu <yih...@alumni.princeton.edu> > wrote: > >> Hi, >> >> I've come across a discrepancy between the pubchem fingerprint obtained >> through CDK (calculated from SMILES) and the pubchem fingerprint extracted >> directly from the pubchem website. For example, the Canonical SMILES of >> compound Ampicillin (pubchem CID 6249) is >> CC1(C(N2C(S1)C(C2=O)NC(=O)C(C3=CC=CC=C3)N)C(=O)O)C. >> The calculation of pubchem fingerprint based on this SMILES by CDK is >> >> 00000000011110110011100000000000010000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000010110000000000101100000000000000000000000000000001100000000000000000000000000000000000000000000000000000000000000000000000000010000000000000000000000000001111000000100000100000000100000000000000000000000110000101000110001011101100000000100101100100000100010000011110000000000001000001000100010000000001000100001110100100001100000000000100000100000000000000000011000000000000000010000000010001000100010000001100010000000010010001000000010100110000000111010101000001001010100110001100101000110000000000000011001001001011000000000101110001000100000000111000110001000100010000000100011100000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000 >> The pubchem fingerprint extracted from pubchem website for this compound >> is >> >> 11100000011110110011100000000000010000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000010110000000000101100000000000000000000000000000001100000000000000000000000000000000010110000000000000000000000000000000000000010000000000000000000000000001111000000100000100000000100000000000000000000000110000101000110001011101100000000100101100100000100010000011110000000000001000001000100010000000001000100001110100100001100000000000100000100000000000000000011000000000000000010000000010001000100010000001100010000000010010001000000010100110000000111010101000001001010100110001100101000110000000000000011001001001011000000000101110001000100000000111000110001000100010000000100011100000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000 >> The two fingerprints differ on positions 0, 1, 2, 213, 215, and 216. >> Have any of you encountered a similar issue or could anyone identify what >> mistake I may have made? Any assistance provided would be greatly >> appreciated! >> >> Thank you, >> Yihan >> >> <#m_5806788826681974359_m_3374493361141601686_DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2> >> _______________________________________________ >> Cdk-user mailing list >> Cdk-user@lists.sourceforge.net >> https://lists.sourceforge.net/lists/listinfo/cdk-user >> >
_______________________________________________ Cdk-user mailing list Cdk-user@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/cdk-user