Hi everyone,

As a followup to my previous post, in reading the Stiefl paper (Chem. Inf. 
Model. 2006, 46, 208-220) closer, I see my question about converting ErG numpy 
array into a bit vector was a little naive.  It turns out that ErG array 
contains floating point numbers, not integers. One could bin these numbers and 
then convert into a count vector, but this seems like a lot of work.

If anyone is interested, I was able create and store ErG fingerprints of a 
large database (one million compounds), and then do a ErG fingerprint 
similarity search in several minutes without using PostgreSQL (see example 
scripts below).  Binary fingerprint searches are much faster, but the method 
below is fast enough for my purposes.



Using the following post as a guide:


The first script below will create ErG Fingerprints for a smiles file and the 
second script will sort the smiles file based on the ErG Taniomoto coefficient 
to a query molecule.

===== CreateErGFingerprints.py =====

import gzip, cPickle
from rdkit import Chem
from rdkit.Chem import AllChem, rdReducedGraphs

mols = Chem.SmilesMolSupplier("MolSupplier.smi",titleLine=False,smilesColumn = 
0,nameColumn = 1,delimiter = " \t")
ergfps = [rdReducedGraphs.GetErGFingerprint(mol) for mol in mols]

===== ErGFingerprintSimilaritySearch.py =====

import gzip, cPickle
from rdkit import Chem
from rdkit.Chem import AllChem, rdReducedGraphs
import numpy as np
import os, re, sys
import fileinput

# ErG FP is not bit vect.
def calc_ergtc( fp1, fp2 ):
    denominator = np.sum( np.dot(fp1,fp1) ) + np.sum( np.dot(fp2,fp2) ) - 
np.sum( np.dot(fp1,fp2 ))
    numerator = np.sum( np.dot(fp1,fp2) )
    return numerator / denominator

for line in fileinput.input():
        (query_smiles, query_name) = line.strip().split("\t")
query_mol   = Chem.MolFromSmiles(query_smiles)
query_ergfp = rdReducedGraphs.GetErGFingerprint(query_mol)

fp_file = "MolSupplier_ergfps.pkl.gz"

mols = Chem.SmilesMolSupplier("MolSupplier.smi",titleLine=False,smilesColumn = 
0,nameColumn = 1,delimiter = " \t")


results = []
f = open("MolSupplier.smi",'r')
for index, line in enumerate(f.readlines()):
    (smiles, name) = line.strip().split("\t")
    ergtc = calc_ergtc(query_ergfp, ergfps[index])
    results.append([smiles, name, ergtc])
sorted_results = sorted(results, key = lambda x: x[2], reverse=True)

print '%s\t%s\t%.4f' % (query_smiles, query_name, 1.0)
for result in sorted_results:
    (smiles, name, ergtc) = result
    print '%s\t%s\t%.4f' % (smiles, name, ergtc)

> On 1 Sep 2017, at 09:36, Konrad Koehler <konrad.koeh...@icloud.com> wrote:
> Hi,
> I am trying to add ErG fingerprints to PostgreSQL using the following post as 
> a guide:
> https://github.com/greglandrum/rdkit_blog/blob/master/notebooks/Custom%20fingerprint%20in%20PostgreSQL.ipynb
> <https://github.com/greglandrum/rdkit_blog/blob/master/notebooks/Custom%20fingerprint%20in%20PostgreSQL.ipynb>
> My installation is as follows:
> Debian 3.16.43-2
> rdkit                     2016.03.4           np111py27_1    rdkit
> rdkit-postgresql          2016.03.4                py27_1    rdkit
> In the example linked above, I had to replace the following line with the 
> next line (presumably because I am running python 2.7 instead of 3.x):
> m = Chem.Mol(pkl.tobytes())
> m = Chem.Mol(str(pkl))
> I then run into the following two problems:
> First problem: Null characters.  When I run the example script (using the 
> Sheridan bit vector fingerprints), I generate the following error message:
> curs.executemany('insert into fps values 
> (%s,bfp_from_binary_text(%s))',[(x,DataStructs.BitVectToBinaryText(y)) for 
> x,y in fps])
> ValueError: A string literal cannot contain NUL (0x00) characters.
> I am not sure what I should do here.  I could strip the null characters from 
> the binary text, but are the null characters supposed to be there? Should I 
> use the bytea data type on PostgreSQL side?
> Second problem: convert numpy array into bit vector
> The linked example creates a fingerprint as a bit vector:
> fp = 
> Sheridan.GetBTFingerprint(m,fpfn=rdMolDescriptors.GetHashedTopologicalTorsionFingerprintAsBitVect)
> whereas rdReducedGraphs.GetErGFingerprint method produces a numpy array:
> fp = rdReducedGraphs.GetErGFingerprint(m)
> Is there anyway of converting this numpy array into a bit vector?
> Any suggestions would be greatly appreciated.  Thanks,
> Konrad

Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
Rdkit-discuss mailing list

Reply via email to