Hi everyone,
As a followup to my previous post, in reading the Stiefl paper (Chem. Inf.
Model. 2006, 46, 208-220) closer, I see my question about converting ErG numpy
array into a bit vector was a little naive. It turns out that ErG array
contains floating point numbers, not integers. One could bin these numbers and
then convert into a count vector, but this seems like a lot of work.
If anyone is interested, I was able create and store ErG fingerprints of a
large database (one million compounds), and then do a ErG fingerprint
similarity search in several minutes without using PostgreSQL (see example
scripts below). Binary fingerprint searches are much faster, but the method
below is fast enough for my purposes.
Cheers,
Konrad
Using the following post as a guide:
https://iwatobipen.wordpress.com/2016/01/16/ergfingerprint-in-rdkit/
<https://iwatobipen.wordpress.com/2016/01/16/ergfingerprint-in-rdkit/>
The first script below will create ErG Fingerprints for a smiles file and the
second script will sort the smiles file based on the ErG Taniomoto coefficient
to a query molecule.
===== CreateErGFingerprints.py =====
import gzip, cPickle
from rdkit import Chem
from rdkit.Chem import AllChem, rdReducedGraphs
mols = Chem.SmilesMolSupplier("MolSupplier.smi",titleLine=False,smilesColumn =
0,nameColumn = 1,delimiter = " \t")
ergfps = [rdReducedGraphs.GetErGFingerprint(mol) for mol in mols]
fp=gzip.open('MolSupplier_ergfps.pkl.gz','wb')
cPickle.dump(ergfps,fp)
fp.close()
===== ErGFingerprintSimilaritySearch.py =====
import gzip, cPickle
from rdkit import Chem
from rdkit.Chem import AllChem, rdReducedGraphs
import numpy as np
import os, re, sys
import fileinput
# ErG FP is not bit vect.
def calc_ergtc( fp1, fp2 ):
denominator = np.sum( np.dot(fp1,fp1) ) + np.sum( np.dot(fp2,fp2) ) -
np.sum( np.dot(fp1,fp2 ))
numerator = np.sum( np.dot(fp1,fp2) )
return numerator / denominator
for line in fileinput.input():
(query_smiles, query_name) = line.strip().split("\t")
query_mol = Chem.MolFromSmiles(query_smiles)
query_ergfp = rdReducedGraphs.GetErGFingerprint(query_mol)
fp_file = "MolSupplier_ergfps.pkl.gz"
mols = Chem.SmilesMolSupplier("MolSupplier.smi",titleLine=False,smilesColumn =
0,nameColumn = 1,delimiter = " \t")
fp=gzip.open(fp_file,'rb')
ergfps=cPickle.load(fp)
fp.close()
results = []
f = open("MolSupplier.smi",'r')
for index, line in enumerate(f.readlines()):
(smiles, name) = line.strip().split("\t")
ergtc = calc_ergtc(query_ergfp, ergfps[index])
results.append([smiles, name, ergtc])
sorted_results = sorted(results, key = lambda x: x[2], reverse=True)
print '%s\t%s\t%.4f' % (query_smiles, query_name, 1.0)
for result in sorted_results:
(smiles, name, ergtc) = result
print '%s\t%s\t%.4f' % (smiles, name, ergtc)
> On 1 Sep 2017, at 09:36, Konrad Koehler <konrad.koeh...@icloud.com> wrote:
>
> Hi,
>
> I am trying to add ErG fingerprints to PostgreSQL using the following post as
> a guide:
>
> https://github.com/greglandrum/rdkit_blog/blob/master/notebooks/Custom%20fingerprint%20in%20PostgreSQL.ipynb
>
> <https://github.com/greglandrum/rdkit_blog/blob/master/notebooks/Custom%20fingerprint%20in%20PostgreSQL.ipynb>
>
> My installation is as follows:
>
> Debian 3.16.43-2
> rdkit 2016.03.4 np111py27_1 rdkit
> rdkit-postgresql 2016.03.4 py27_1 rdkit
>
> In the example linked above, I had to replace the following line with the
> next line (presumably because I am running python 2.7 instead of 3.x):
> m = Chem.Mol(pkl.tobytes())
> m = Chem.Mol(str(pkl))
>
> I then run into the following two problems:
>
>
> First problem: Null characters. When I run the example script (using the
> Sheridan bit vector fingerprints), I generate the following error message:
>
> curs.executemany('insert into fps values
> (%s,bfp_from_binary_text(%s))',[(x,DataStructs.BitVectToBinaryText(y)) for
> x,y in fps])
> ValueError: A string literal cannot contain NUL (0x00) characters.
>
> I am not sure what I should do here. I could strip the null characters from
> the binary text, but are the null characters supposed to be there? Should I
> use the bytea data type on PostgreSQL side?
>
>
> Second problem: convert numpy array into bit vector
>
> The linked example creates a fingerprint as a bit vector:
> fp =
> Sheridan.GetBTFingerprint(m,fpfn=rdMolDescriptors.GetHashedTopologicalTorsionFingerprintAsBitVect)
>
> whereas rdReducedGraphs.GetErGFingerprint method produces a numpy array:
> fp = rdReducedGraphs.GetErGFingerprint(m)
>
> Is there anyway of converting this numpy array into a bit vector?
>
>
> Any suggestions would be greatly appreciated. Thanks,
>
> Konrad
>
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss