On 09/02/2017 11:05 PM, Konrad Koehler wrote: > Hi everyone, > > As a followup to my previous post, in reading the Stiefl paper (Chem. > Inf. Model. 2006, 46, 208-220) closer, I see my question about > converting ErG numpy array into a bit vector was a little naive. It > turns out that ErG array contains floating point numbers, not integers. > One could bin these numbers and then convert into a count vector, but > this seems like a lot of work.
There are better things than histograms out there, this one for example: @article{Wand1994, title = {Fast {Computation} of {Multivariate} {Kernel} {Estimators}}, volume = {3}, issn = {1061-8600}, url = {http://amstat.tandfonline.com/doi/abs/10.1080/10618600.1994.10474656}, doi = {10.1080/10618600.1994.10474656}, number = {4}, journal = {Journal of Computational and Graphical Statistics}, author = {Wand, M. P.}, month = dec, year = {1994}, pages = {433--445} "Linear binning" they call it. > If anyone is interested, I was able create and store ErG fingerprints of > a large database (one million compounds), and then do a ErG fingerprint > similarity search in several minutes without using PostgreSQL (see > example scripts below). Binary fingerprint searches are much faster, > but the method below is fast enough for my purposes. > > Cheers, > > Konrad > > Using the following post as a guide: > > https://iwatobipen.wordpress.com/2016/01/16/ergfingerprint-in-rdkit/ > > The first script below will create ErG Fingerprints for a smiles file > and the second script will sort the smiles file based on the ErG > Taniomoto coefficient to a query molecule. > > ===== CreateErGFingerprints.py ===== > > import gzip, cPickle > from rdkit import Chem > from rdkit.Chem import AllChem, rdReducedGraphs > > mols = > Chem.SmilesMolSupplier("MolSupplier.smi",titleLine=False,smilesColumn = > 0,nameColumn = 1,delimiter = " \t") > ergfps = [rdReducedGraphs.GetErGFingerprint(mol) for mol in mols] > fp=gzip.open('MolSupplier_ergfps.pkl.gz','wb') > cPickle.dump(ergfps,fp) > fp.close() > > ===== ErGFingerprintSimilaritySearch.py ===== > > import gzip, cPickle > from rdkit import Chem > from rdkit.Chem import AllChem, rdReducedGraphs > import numpy as np > import os, re, sys > import fileinput > > # ErG FP is not bit vect. > def calc_ergtc( fp1, fp2 ): > denominator = np.sum( np.dot(fp1,fp1) ) + np.sum( np.dot(fp2,fp2) ) > - np.sum( np.dot(fp1,fp2 )) > numerator = np.sum( np.dot(fp1,fp2) ) > return numerator / denominator > > for line in fileinput.input(): > (query_smiles, query_name) = line.strip().split("\t") > > query_mol = Chem.MolFromSmiles(query_smiles) > query_ergfp = rdReducedGraphs.GetErGFingerprint(query_mol) > > fp_file = "MolSupplier_ergfps.pkl.gz" > > mols = > Chem.SmilesMolSupplier("MolSupplier.smi",titleLine=False,smilesColumn = > 0,nameColumn = 1,delimiter = " \t") > > fp=gzip.open(fp_file,'rb') > ergfps=cPickle.load(fp) > fp.close() > > results = [] > f = open("MolSupplier.smi",'r') > for index, line in enumerate(f.readlines()): > (smiles, name) = line.strip().split("\t") > ergtc = calc_ergtc(query_ergfp, ergfps[index]) > results.append([smiles, name, ergtc]) > sorted_results = sorted(results, key = lambda x: x[2], reverse=True) > > print '%s\t%s\t%.4f' % (query_smiles, query_name, 1.0) > for result in sorted_results: > (smiles, name, ergtc) = result > print '%s\t%s\t%.4f' % (smiles, name, ergtc) > >> On 1 Sep 2017, at 09:36, Konrad Koehler <konrad.koeh...@icloud.com >> <mailto:konrad.koeh...@icloud.com>> wrote: >> >> Hi, >> >> I am trying to add ErG fingerprints to PostgreSQL using the following >> post as a guide: >> >> https://github.com/greglandrum/rdkit_blog/blob/master/notebooks/Custom%20fingerprint%20in%20PostgreSQL.ipynb >> >> My installation is as follows: >> >> Debian 3.16.43-2 >> rdkit 2016.03.4 np111py27_1 rdkit >> rdkit-postgresql 2016.03.4 py27_1 rdkit >> >> In the example linked above, I had to replace the following line with >> the next line (presumably because I am running python 2.7 instead of 3.x): >> m = Chem.Mol(pkl.tobytes()) >> m = Chem.Mol(str(pkl)) >> >> I then run into the following two problems: >> >> * >> * >> *First problem*: Null characters. When I run the example script >> (using the Sheridan bit vector fingerprints), I generate the following >> error message: >> >> /curs.executemany('insert into fps values >> (%s,bfp_from_binary_text(%s))',[(x,DataStructs.BitVectToBinaryText(y)) >> for x,y in fps])/ >> /ValueError: A string literal cannot contain NUL (0x00) characters./ >> >> I am not sure what I should do here. I could strip the null characters >> from the binary text, but are the null characters supposed to be >> there? Should I use the bytea data type on PostgreSQL side? >> >> * >> * >> *Second problem*: convert numpy array into bit vector >> >> The linked example creates a fingerprint as a bit vector: >> fp = >> Sheridan.GetBTFingerprint(m,fpfn=rdMolDescriptors.GetHashedTopologicalTorsionFingerprintAsBitVect) >> >> whereas rdReducedGraphs.GetErGFingerprint method produces a numpy array: >> fp = rdReducedGraphs.GetErGFingerprint(m) >> >> Is there anyway of converting this numpy array into a bit vector? >> >> >> Any suggestions would be greatly appreciated. Thanks, >> >> Konrad >> > > > > ------------------------------------------------------------------------------ > Check out the vibrant tech community on one of the world's most > engaging tech sites, Slashdot.org! http://sdm.link/slashdot > > > > _______________________________________________ > Rdkit-discuss mailing list > Rdkit-discuss@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/rdkit-discuss > ------------------------------------------------------------------------------ Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot _______________________________________________ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss