Hi John,

Thank you very much for the quick reply. I greatly appreciate your help in
resolving my confusion!
A Github issue has been opened as you suggested. Thank you.

Best,
Yihan

On Wed, Apr 19, 2023 at 4:45 PM John Mayfield <john.wilkinson...@gmail.com>
wrote:

> Hi Yihan,
>
> Could you open an issue on GitHub, there are some small changes we could
> make to match more closely.
>
> The PubChem CACTVS fingerprint implementation is private and so it's not
> possible to match exactly based on code. However it should be "relatively"
> close to what has been documented:
>
> https://ftp.ncbi.nlm.nih.gov/pubchem/specifications/pubchem_fingerprints.txt
>
> For some reason (we should change that) you need to make hydrogens
> explicit for the fingerprint:
>
> SmilesParser smipar = new
>> SmilesParser(SilentChemObjectBuilder.getInstance());
>> IAtomContainer mol =
>> smipar.parseSmiles("CC1(C(N2C(S1)C(C2=O)NC(=O)C(C3=CC=CC=C3)N)C(=O)O)C");
>> AtomContainerManipulator.convertImplicitToExplicitHydrogens(mol);
>> BitSet fp = new
>> PubchemFingerprinter(SilentChemObjectBuilder.getInstance()).getBitFingerprint(mol)
>>
>>      .asBitSet();
>> System.out.println(fp);
>>
>
> That takes care of (0,1,2):
>
> {0, 1, 2, 9, 10, 11, 12, 14, 15, 18, 19, 20, 33, 129, 131, 132, 143, 145,
> 146, 178, 179, 255, 283, 284, 285, 286, 293, 299, 308, 332, 333, 338, 340,
> 344, 345, 349, 351, 352, 353, 355, 356, 365, 368, 370, 371, 374, 380, 384,
> 390, 391, 392, 393, 406, 412, 416, 420, 430, 434, 439, 440, 441, 443, 446,
> 451, 452, 464, 470, 489, 490, 507, 516, 520, 524, 528, 535, 536, 540, 549,
> 552, 556, 564, 566, 569, 570, 578, 579, 580, 582, 584, 586, 592, 595, 597,
> 599, 602, 603, 607, 608, 611, 613, 617, 618, 633, 634, 637, 640, 643, 645,
> 646, 656, 658, 659, 660, 664, 668, 677, 678, 679, 683, 684, 688, 692, 696,
> 704, 708, 709, 710}
>
> The other different bits are to do with which ringset we use:
>
> 213 >= 1 any ring size 7
> 215 >= 1 saturated or aromatic nitrogen-containing ring size 7
> 216 >= 1 saturated or aromatic heteroatom-containing ring size 7
>
> IIRC PubChem/CACTVS substructure keys use a different cycle definition
> (based on shortest cycle through triples i.e. atom-bond-atom-bond-atom)
> rather than SSSR/MCB. We didn't have the option to find these when the
> fingerprint was first written but we do now. We can make this small change:
>
> PubChemFingerprint.java:
>         public CountRings(IAtomContainer m) {
>             // ringSet = Cycles.sssr(m).toRingSet(); // wrong
>             ringSet = Cycles.tripletShort(m).toRingSet();
>         }
>
> and get the expected bits set:
>
> {0, 1, 2, 9, 10, 11, 12, 14, 15, 18, 19, 20, 33, 129, 131, 132, 143, 145,
> 146, 178, 179, 213, 215, 216, 255, 283, 284, 285, 286, 293, 299, 308, 332,
> 333, 338, 340, 344, 345, 349, 351, 352, 353, 355, 356, 365, 368, 370, 371,
> 374, 380, 384, 390, 391, 392, 393, 406, 412, 416, 420, 430, 434, 439, 440,
> 441, 443, 446, 451, 452, 464, 470, 489, 490, 507, 516, 520, 524, 528, 535,
> 536, 540, 549, 552, 556, 564, 566, 569, 570, 578, 579, 580, 582, 584, 586,
> 592, 595, 597, 599, 602, 603, 607, 608, 611, 613, 617, 618, 633, 634, 637,
> 640, 643, 645, 646, 656, 658, 659, 660, 664, 668, 677, 678, 679, 683, 684,
> 688, 692, 696, 704, 708, 709, 710}
>
> I suggest we add an option to the fingerprint use the correct ring set,
> but we should also check for other discrepancies in PubChem (i.e. please
> open a GitHub issue).
>
> John
>
> On Wed, 19 Apr 2023 at 09:09, Yihan Wu <yih...@alumni.princeton.edu>
> wrote:
>
>> Hi,
>>
>> I've come across a discrepancy between the pubchem fingerprint obtained
>> through CDK (calculated from SMILES) and the pubchem fingerprint extracted
>> directly from the pubchem website. For example, the Canonical SMILES of
>>  compound Ampicillin (pubchem CID 6249) is
>> CC1(C(N2C(S1)C(C2=O)NC(=O)C(C3=CC=CC=C3)N)C(=O)O)C.
>> The calculation of pubchem fingerprint based on this SMILES by CDK is
>>
>> 00000000011110110011100000000000010000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000010110000000000101100000000000000000000000000000001100000000000000000000000000000000000000000000000000000000000000000000000000010000000000000000000000000001111000000100000100000000100000000000000000000000110000101000110001011101100000000100101100100000100010000011110000000000001000001000100010000000001000100001110100100001100000000000100000100000000000000000011000000000000000010000000010001000100010000001100010000000010010001000000010100110000000111010101000001001010100110001100101000110000000000000011001001001011000000000101110001000100000000111000110001000100010000000100011100000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
>> The pubchem fingerprint extracted from pubchem website for this compound
>> is
>>
>> 11100000011110110011100000000000010000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000010110000000000101100000000000000000000000000000001100000000000000000000000000000000010110000000000000000000000000000000000000010000000000000000000000000001111000000100000100000000100000000000000000000000110000101000110001011101100000000100101100100000100010000011110000000000001000001000100010000000001000100001110100100001100000000000100000100000000000000000011000000000000000010000000010001000100010000001100010000000010010001000000010100110000000111010101000001001010100110001100101000110000000000000011001001001011000000000101110001000100000000111000110001000100010000000100011100000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
>> The two fingerprints differ on positions 0, 1, 2, 213, 215, and 216.
>> Have any of you encountered a similar issue or could anyone identify what
>> mistake I may have made? Any assistance provided would be greatly
>> appreciated!
>>
>> Thank you,
>> Yihan
>>
>> <#m_5806788826681974359_m_3374493361141601686_DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>
>> _______________________________________________
>> Cdk-user mailing list
>> Cdk-user@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/cdk-user
>>
>
_______________________________________________
Cdk-user mailing list
Cdk-user@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/cdk-user

Reply via email to