Re: [Rdkit-discuss] ErG: 2D Pharmacophore Similarity Searches

Francois BERENGER Sun, 03 Sep 2017 16:51:40 -0700

On 09/02/2017 11:05 PM, Konrad Koehler wrote:
> Hi everyone,
> 
> As a followup to my previous post, in reading the Stiefl paper (Chem. 
> Inf. Model. 2006, 46, 208-220) closer, I see my question about 
> converting ErG numpy array into a bit vector was a little naive.  It 
> turns out that ErG array contains floating point numbers, not integers. 
> One could bin these numbers and then convert into a count vector, but 
> this seems like a lot of work.


There are better things than histograms out there, this one for example:

@article{Wand1994,
        title = {Fast {Computation} of {Multivariate} {Kernel} {Estimators}},
        volume = {3},
        issn = {1061-8600},
        url = 
{http://amstat.tandfonline.com/doi/abs/10.1080/10618600.1994.10474656},
        doi = {10.1080/10618600.1994.10474656},
        number = {4},
        journal = {Journal of Computational and Graphical Statistics},
        author = {Wand, M. P.},
        month = dec,
        year = {1994},
        pages = {433--445}

"Linear binning" they call it.

> If anyone is interested, I was able create and store ErG fingerprints of 
> a large database (one million compounds), and then do a ErG fingerprint 
> similarity search in several minutes without using PostgreSQL (see 
> example scripts below).  Binary fingerprint searches are much faster, 
> but the method below is fast enough for my purposes.
> 
> Cheers,
> 
> Konrad
> 
> Using the following post as a guide:
> 
> https://iwatobipen.wordpress.com/2016/01/16/ergfingerprint-in-rdkit/
> 
> The first script below will create ErG Fingerprints for a smiles file 
> and the second script will sort the smiles file based on the ErG 
> Taniomoto coefficient to a query molecule.
> 
> ===== CreateErGFingerprints.py =====
> 
> import gzip, cPickle
> from rdkit import Chem
> from rdkit.Chem import AllChem, rdReducedGraphs
> 
> mols = 
> Chem.SmilesMolSupplier("MolSupplier.smi",titleLine=False,smilesColumn = 
> 0,nameColumn = 1,delimiter = " \t")
> ergfps = [rdReducedGraphs.GetErGFingerprint(mol) for mol in mols]
> fp=gzip.open('MolSupplier_ergfps.pkl.gz','wb')
> cPickle.dump(ergfps,fp)
> fp.close()
> 
> ===== ErGFingerprintSimilaritySearch.py =====
> 
> import gzip, cPickle
> from rdkit import Chem
> from rdkit.Chem import AllChem, rdReducedGraphs
> import numpy as np
> import os, re, sys
> import fileinput
> 
> # ErG FP is not bit vect.
> def calc_ergtc( fp1, fp2 ):
>      denominator = np.sum( np.dot(fp1,fp1) ) + np.sum( np.dot(fp2,fp2) ) 
> - np.sum( np.dot(fp1,fp2 ))
>      numerator = np.sum( np.dot(fp1,fp2) )
>      return numerator / denominator
> 
> for line in fileinput.input():
> (query_smiles, query_name) = line.strip().split("\t")
> 
> query_mol   = Chem.MolFromSmiles(query_smiles)
> query_ergfp = rdReducedGraphs.GetErGFingerprint(query_mol)
> 
> fp_file = "MolSupplier_ergfps.pkl.gz"
> 
> mols = 
> Chem.SmilesMolSupplier("MolSupplier.smi",titleLine=False,smilesColumn = 
> 0,nameColumn = 1,delimiter = " \t")
> 
> fp=gzip.open(fp_file,'rb')
> ergfps=cPickle.load(fp)
> fp.close()
> 
> results = []
> f = open("MolSupplier.smi",'r')
> for index, line in enumerate(f.readlines()):
>      (smiles, name) = line.strip().split("\t")
>      ergtc = calc_ergtc(query_ergfp, ergfps[index])
>      results.append([smiles, name, ergtc])
> sorted_results = sorted(results, key = lambda x: x[2], reverse=True)
> 
> print '%s\t%s\t%.4f' % (query_smiles, query_name, 1.0)
> for result in sorted_results:
>      (smiles, name, ergtc) = result
>      print '%s\t%s\t%.4f' % (smiles, name, ergtc)
> 
>> On 1 Sep 2017, at 09:36, Konrad Koehler <konrad.koeh...@icloud.com 
>> <mailto:konrad.koeh...@icloud.com>> wrote:
>>
>> Hi,
>>
>> I am trying to add ErG fingerprints to PostgreSQL using the following 
>> post as a guide:
>>
>> https://github.com/greglandrum/rdkit_blog/blob/master/notebooks/Custom%20fingerprint%20in%20PostgreSQL.ipynb
>>
>> My installation is as follows:
>>
>> Debian 3.16.43-2
>> rdkit                     2016.03.4           np111py27_1    rdkit
>> rdkit-postgresql          2016.03.4                py27_1    rdkit
>>
>> In the example linked above, I had to replace the following line with 
>> the next line (presumably because I am running python 2.7 instead of 3.x):
>> m = Chem.Mol(pkl.tobytes())
>> m = Chem.Mol(str(pkl))
>>
>> I then run into the following two problems:
>>
>> *
>> *
>> *First problem*: Null characters.  When I run the example script 
>> (using the Sheridan bit vector fingerprints), I generate the following 
>> error message:
>>
>> /curs.executemany('insert into fps values 
>> (%s,bfp_from_binary_text(%s))',[(x,DataStructs.BitVectToBinaryText(y)) 
>> for x,y in fps])/
>> /ValueError: A string literal cannot contain NUL (0x00) characters./
>>
>> I am not sure what I should do here. I could strip the null characters 
>> from the binary text, but are the null characters supposed to be 
>> there? Should I use the bytea data type on PostgreSQL side?
>>
>> *
>> *
>> *Second problem*: convert numpy array into bit vector
>>
>> The linked example creates a fingerprint as a bit vector:
>> fp = 
>> Sheridan.GetBTFingerprint(m,fpfn=rdMolDescriptors.GetHashedTopologicalTorsionFingerprintAsBitVect)
>>
>> whereas rdReducedGraphs.GetErGFingerprint method produces a numpy array:
>> fp = rdReducedGraphs.GetErGFingerprint(m)
>>
>> Is there anyway of converting this numpy array into a bit vector?
>>
>>
>> Any suggestions would be greatly appreciated.  Thanks,
>>
>> Konrad
>>
> 
> 
> 
> ------------------------------------------------------------------------------
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
> 
> 
> 
> _______________________________________________
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
> 

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Re: [Rdkit-discuss] ErG: 2D Pharmacophore Similarity Searches

Reply via email to