[Rdkit-discuss] Matching Generalized Compounds
Hi All, I’m interested in having GetSubstructMatches return non-“null” results in the following example. The results should lead to a match where atom 1 maps to atom 11, 2 to 12, etc. m1 = Chem.MolFromSmiles('[*:1][CH2:2][C:3]([CH3:4])=[CH2:5]') m2 = Chem.MolFromSmiles('[F:11][CH2:12][C:13]([*:14])=[CH2:15]') ### do something here so that the mols will match ### qp = Chem.AdjustQueryParameters() qp.makeDummiesQueries = True m1 = Chem.AdjustQueryProperties(m1, qp) m2 = Chem.AdjustQueryProperties(m2, qp) # I’d like both of the following to return results m1.GetSubstructMatches(m2) m2.GetSubstructMatches(m1) My understanding of why these mols currently do not match is as follows: because only the dummy atoms are made queries (based on my query parameter adjustment), when one mol is matched to another dummy 1 may match to F:11, but dummy 14 will then not match to methyl:14. This is because (as I understand), normal atoms can only be matched by queries, and cannot match them themselves. Potential ideas to make this work as I’d like: 1. Override atom.Match in the python code – not sure that this would work since the C++ version of this function is what would be called during GetSubstructMatches 2. Override atom.Match in the C++ code – not quite sure how to do this, or what side affects it might have. Ideally the changes I make would only affect this example (and other similar ones) 3. Make all atoms in both molecules QueryAtoms, but otherwise leave them unchanged. I’m not quite sure how to do this! Does anyone have any ideas for what the best approach here would be, or knows if there is already built in functionality for something like this? I’d prefer to not use SMARTS to construct my molecules if possible, since I don’t really think of them as queries, just as other molecules in the system that happen to not be fully specified. - Kovas -- Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] back tracking descriptor names from RandomForest feature_importance
Hi Shojiro, This might not be the most elegant and efficient way but it worked for what I wanted to do. I changed my apply function as below: allDescp=[name[0] for name in Descriptors._descList] for name in allDescp: temp=MoleculeDescriptors.MolecularDescriptorCalculator([name]) df[name]=df['SMILES'].apply(lambda x: temp.CalcDescriptors(Chem.MolFromSmiles(x))[0]) y=df['Target].values X=np.array(df.drop('Target')) features_list = X.columns.values[0::] model= RandomForestRegressor(n_estimators = 1000, random_state = 42) model.fit(X,y) feature_importance = model.feature_importances_ threshold = 5 important_index = np.where(feature_importance > threshold)[0] important_features = features_list[important_index] Ali On Tue, Aug 21, 2018 at 6:51 AM Ali Eftekhari wrote: > hi Shojiro, > > Thanks for your response but print > (np.argsort(rfregress.feature_importances_)[::-1]) returns the row indices > but what I want is the column names so it can give me information which > features are important. > > On Mon, Aug 20, 2018 at 9:31 PM Shojiro Shibayama > wrote: > >> Dear Ali, >> >> Please run first the following code, which may help you: >> >> ```python >> import numpy as np >> np.argsort(rfregress.feature_importances_)[::-1] >> ``` >> >> The `argsort` will return the indexes of the important features in >> ascending order and [::-1] reverses the order. >> The indexes for feature importance must correspond to the order of >> variables (or the order in 'allDescp' of your code), so use these >> variables, you'll get the information that you want. >> >> Sincerely yours, >> Shojiro >> >> >> On Tue, 21 Aug 2018 at 10:34, Ali Eftekhari >> wrote: >> >>> Hello rdkit, >>> >>> This might be trivial but I am beginner and don't know how to do it. >>> >>> I am building a simple model to predict target property. I have pandas >>> dataframe (df) whose columns are 'SMILES' and 'Target'. >>> >>> #calculating the descriptors as below: >>> llDescp=[name[0] for name in Descriptors._descList] >>> calc=MoleculeDescriptors.MolecularDescriptorCalculator(allDescp) >>> df ['fp']=df['SMILES'].apply(lambda x: >>> calc.CalcDescriptors(Chem.MolFromSmiles(x))) >>> >>> #converting the fingerprint to numpy array >>> y=df['Target'].values >>> X=np.array(list(df['fp'])) >>> >>> #preprocessing >>> X_train, X_test, y_train, y_test=train_test_split(X, y, test_size=0.25, >>> random_state=42) >>> st=StandardScaler() >>> X=st.fit_transform(X) >>> >>> #random forest model >>> model=RandomForestRegressor(n_estimators=10) >>> model.fit(X_train, y_train) >>> >>> My problem is that I don't know how to get the meaningful >>> feature_importance. The following will return the values of descriptors >>> but there is no labels and so I don't know how to figure out which features >>> are important. >>> >>> print (sorted (rfregress.feature_importances_)) >>> >>> Thanks for your help! >>> >>> >>> >>> >>> -- >>> Check out the vibrant tech community on one of the world's most >>> engaging tech sites, Slashdot.org! http://sdm.link/slashdot >>> ___ >>> Rdkit-discuss mailing list >>> Rdkit-discuss@lists.sourceforge.net >>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss >>> >> >> >> -- >> >> The University of Tokyo >> 2nd year Ph.D. candidate >> Shojiro Shibayama >> >> > -- Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] Alignment using LIgpargen file
Dear Phuong, it works for me with the latest RDKit release. I am wondering if you are using an older RDKit version where maybe the PDB parser was not assigning formal charges based on connectivity? Then you would have to manually set the formal charge of the protonated nitrogen atom (# 2 N) to +1. Cheers, p. On 08/22/18 19:06, Phuong Chau wrote: Hi Paolo, I got this new error when I tried with different chemicals: python 3DAlignmentwith3OAwithGRO.py CCNC\(N\)N_afterMD.pdb CC\[NH2\+\]CCN_Ligpargen.pdb [12:02:45] Explicit valence for atom # 2 N, 4, is greater than permitted Traceback (most recent call last): File "3DAlignmentwith3OAwithGRO.py", line 12, in rdDistGeom.EmbedMolecule(prbMolwithH) Boost.Python.ArgumentError: Python argument types in rdkit.Chem.rdDistGeom.EmbedMolecule(NoneType) did not match C++ signature: EmbedMolecule(RDKit::ROMol {lvalue} mol, unsigned int maxAttempts=0, int randomSeed=-1, bool clearConfs=True, bool useRandomCoords=False, double boxSizeMult=2.0, bool randNegEig=True, unsigned int numZeroFail=1, boost::python::dict {lvalue} coordMap={}, double forceTol=0.001, bool ignoreSmoothingFailures=False) The second argument I also downloaded it from Ligpargen but it gave me this error. I am not so sure why it happened. I also attached the two files : On Tue, Aug 21, 2018 at 3:19 PM, Paolo Tosco mailto:paolo.tosco.m...@gmail.com>> wrote: Hi Phuong, it does have hydrogens after the alignment: $ ls UNK_A1C198.pdb align.py drug.pdb $ python align.py drug.pdb UNK_A1C198.pdb $ ls UNK.pdb UNK_A1C198.pdb align.py drug.pdb $ cat UNK.pdb ATOM 1 C00 UNK 1 37.884 56.016 51.678 1.00 0.00 C ATOM 2 C01 UNK 1 37.501 54.554 51.968 1.00 0.00 C ATOM 3 C02 UNK 1 36.085 54.316 52.531 1.00 0.00 C ATOM 4 C03 UNK 1 35.697 55.221 53.715 1.00 0.00 C ATOM 5 C04 UNK 1 34.907 56.485 53.304 1.00 0.00 C ATOM 6 C05 UNK 1 35.788 57.733 53.108 1.00 0.00 C ATOM 7 C06 UNK 1 36.227 57.998 51.654 1.00 0.00 C ATOM 8 C07 UNK 1 36.813 56.799 50.889 1.00 0.00 C ATOM 9 H08 UNK 1 38.821 56.001 51.080 1.00 0.00 H ATOM 10 H09 UNK 1 38.160 56.522 52.625 1.00 0.00 H ATOM 11 H0A UNK 1 37.598 53.969 51.027 1.00 0.00 H ATOM 12 H0B UNK 1 38.243 54.138 52.684 1.00 0.00 H ATOM 13 H0C UNK 1 35.325 54.368 51.723 1.00 0.00 H ATOM 14 H0D UNK 1 36.057 53.262 52.884 1.00 0.00 H ATOM 15 H0E UNK 1 35.031 54.622 54.375 1.00 0.00 H ATOM 16 H0F UNK 1 36.586 55.471 54.333 1.00 0.00 H ATOM 17 H0G UNK 1 34.203 56.712 54.134 1.00 0.00 H ATOM 18 H0H UNK 1 34.270 56.304 52.411 1.00 0.00 H ATOM 19 H0I UNK 1 35.198 58.620 53.428 1.00 0.00 H ATOM 20 H0J UNK 1 36.665 57.714 53.788 1.00 0.00 H ATOM 21 H0K UNK 1 36.982 58.814 51.667 1.00 0.00 H ATOM 22 H0M UNK 1 35.352 58.378 51.082 1.00 0.00 H ATOM 23 H0N UNK 1 35.991 56.144 50.537 1.00 0.00 H ATOM 24 H0O UNK 1 37.280 57.194 49.960 1.00 0.00 H CONECT 1 2 8 9 10 CONECT 2 3 11 12 CONECT 3 4 13 14 CONECT 4 5 15 16 CONECT 5 6 17 18 CONECT 6 7 19 20 CONECT 7 8 21 22 CONECT 8 23 24 END p. On 08/21/18 20:46, Phuong Chau wrote: Hi Paolo, I tried the code, it worked with the alignment but the new chemical after alignment still does not have any H atoms. If I add H after alignment, it has weird structure with H atoms. Would you please show me how to fix this problem? Best, Phuong Chau On Tue, Aug 21, 2018 at 8:01 AM, Paolo Tosco mailto:paolo.tosco.m...@gmail.com>> wrote: Hi Phuong, If you wish to retain Hs you just need to set removeHs = False when you call MolFromPDBFile(): # align.py import sys from rdkit import Chem from rdkit.Chem import rdDistGeom, rdMolAlign, rdForceFieldHelpers from rdkit import Chem refMolwithH = Chem.MolFromPDBFile(sys.argv[1], removeHs = False) s = sys.argv[2] prbMolwithH = Chem.MolFromPDBFile(s, removeHs = False) idx=s.find('_') chemB= s[:idx] rdDistGeom.EmbedMolecule(prbMolwithH)
Re: [Rdkit-discuss] Alignment using LIgpargen file
Hi Paolo, I got this new error when I tried with different chemicals: python 3DAlignmentwith3OAwithGRO.py CCNC\(N\)N_afterMD.pdb CC\[NH2\+\]CCN_Ligpargen.pdb [12:02:45] Explicit valence for atom # 2 N, 4, is greater than permitted Traceback (most recent call last): File "3DAlignmentwith3OAwithGRO.py", line 12, in rdDistGeom.EmbedMolecule(prbMolwithH) Boost.Python.ArgumentError: Python argument types in rdkit.Chem.rdDistGeom.EmbedMolecule(NoneType) did not match C++ signature: EmbedMolecule(RDKit::ROMol {lvalue} mol, unsigned int maxAttempts=0, int randomSeed=-1, bool clearConfs=True, bool useRandomCoords=False, double boxSizeMult=2.0, bool randNegEig=True, unsigned int numZeroFail=1, boost::python::dict {lvalue} coordMap={}, double forceTol=0.001, bool ignoreSmoothingFailures=False) The second argument I also downloaded it from Ligpargen but it gave me this error. I am not so sure why it happened. I also attached the two files : On Tue, Aug 21, 2018 at 3:19 PM, Paolo Tosco wrote: > Hi Phuong, > > it does have hydrogens after the alignment: > > $ ls > UNK_A1C198.pdb align.py drug.pdb > > $ python align.py drug.pdb UNK_A1C198.pdb > > $ ls > UNK.pdb UNK_A1C198.pdb align.py drug.pdb > $ cat UNK.pdb > ATOM 1 C00 UNK 1 37.884 56.016 51.678 1.00 > 0.00 C > ATOM 2 C01 UNK 1 37.501 54.554 51.968 1.00 > 0.00 C > ATOM 3 C02 UNK 1 36.085 54.316 52.531 1.00 > 0.00 C > ATOM 4 C03 UNK 1 35.697 55.221 53.715 1.00 > 0.00 C > ATOM 5 C04 UNK 1 34.907 56.485 53.304 1.00 > 0.00 C > ATOM 6 C05 UNK 1 35.788 57.733 53.108 1.00 > 0.00 C > ATOM 7 C06 UNK 1 36.227 57.998 51.654 1.00 > 0.00 C > ATOM 8 C07 UNK 1 36.813 56.799 50.889 1.00 > 0.00 C > ATOM 9 H08 UNK 1 38.821 56.001 51.080 1.00 > 0.00 H > ATOM 10 H09 UNK 1 38.160 56.522 52.625 1.00 > 0.00 H > ATOM 11 H0A UNK 1 37.598 53.969 51.027 1.00 > 0.00 H > ATOM 12 H0B UNK 1 38.243 54.138 52.684 1.00 > 0.00 H > ATOM 13 H0C UNK 1 35.325 54.368 51.723 1.00 > 0.00 H > ATOM 14 H0D UNK 1 36.057 53.262 52.884 1.00 > 0.00 H > ATOM 15 H0E UNK 1 35.031 54.622 54.375 1.00 > 0.00 H > ATOM 16 H0F UNK 1 36.586 55.471 54.333 1.00 > 0.00 H > ATOM 17 H0G UNK 1 34.203 56.712 54.134 1.00 > 0.00 H > ATOM 18 H0H UNK 1 34.270 56.304 52.411 1.00 > 0.00 H > ATOM 19 H0I UNK 1 35.198 58.620 53.428 1.00 > 0.00 H > ATOM 20 H0J UNK 1 36.665 57.714 53.788 1.00 > 0.00 H > ATOM 21 H0K UNK 1 36.982 58.814 51.667 1.00 > 0.00 H > ATOM 22 H0M UNK 1 35.352 58.378 51.082 1.00 > 0.00 H > ATOM 23 H0N UNK 1 35.991 56.144 50.537 1.00 > 0.00 H > ATOM 24 H0O UNK 1 37.280 57.194 49.960 1.00 > 0.00 H > CONECT1289 10 > CONECT23 11 12 > CONECT34 13 14 > CONECT45 15 16 > CONECT56 17 18 > CONECT67 19 20 > CONECT78 21 22 > CONECT8 23 24 > END > > p. > > > On 08/21/18 20:46, Phuong Chau wrote: > > Hi Paolo, > > I tried the code, it worked with the alignment but the new chemical after > alignment still does not have any H atoms. If I add H after alignment, it > has weird structure with H atoms. Would you please show me how to fix this > problem? > > Best, > Phuong Chau > > On Tue, Aug 21, 2018 at 8:01 AM, Paolo Tosco > wrote: > >> Hi Phuong, >> >> If you wish to retain Hs you just need to set removeHs = False when you >> call MolFromPDBFile(): >> >> # align.py >> >> import sys >> from rdkit import Chem >> from rdkit.Chem import rdDistGeom, rdMolAlign, rdForceFieldHelpers >> from rdkit import Chem >> >> refMolwithH = Chem.MolFromPDBFile(sys.argv[1], removeHs = False) >> s = sys.argv[2] >> prbMolwithH = Chem.MolFromPDBFile(s, removeHs = False) >> idx=s.find('_') >> chemB= s[:idx] >> >> rdDistGeom.EmbedMolecule(prbMolwithH) >> rdForceFieldHelpers.UFFOptimizeMolecule(prbMolwithH) >> >> ##Alignment >> pyO3A = rdMolAlign.GetO3A(prbMolwithH, refMolwithH) >> score = pyO3A.Align() >> >> ##3D coords of Chem B after alignmnet >> Chem.MolToPDBFile(prbMolwithH,'{}.pdb'.format(chemB)) >> >> and then >> >> ./align.py drug.pdb UNK_A1C198.pdb >> Cheers, >> p. >> >> On 08/20/18 19:02, Phuong Chau wrote: >> >> Hello everyone, >> >> I am trying to align two chemicals using their pdb files with the >> following script: >> >> *refMolwithH = Chem.MolFromPDBFile(sys.argv[1])* >> *s = sys.argv[2]* >> *prbMolwithH = Chem.MolFromPDBFile(s)* >> *idx=s.find('_')* >> *chemB= s[:idx]* >> >>