[Rdkit-discuss] Matching Generalized Compounds

2018-08-22 Thread Kovas Palunas
Hi All,

I’m interested in having GetSubstructMatches return non-“null” results in the 
following example.  The results should lead to a match where atom 1 maps to 
atom 11, 2 to 12, etc.

m1 = Chem.MolFromSmiles('[*:1][CH2:2][C:3]([CH3:4])=[CH2:5]')
m2 = Chem.MolFromSmiles('[F:11][CH2:12][C:13]([*:14])=[CH2:15]')

### do something here so that the mols will match ###
qp = Chem.AdjustQueryParameters()
qp.makeDummiesQueries = True
m1 = Chem.AdjustQueryProperties(m1, qp)
m2 = Chem.AdjustQueryProperties(m2, qp)

# I’d like both of the following to return results
m1.GetSubstructMatches(m2)
m2.GetSubstructMatches(m1)

My understanding of why these mols currently do not match is as follows:  
because only the dummy atoms are made queries (based on my query parameter 
adjustment), when one mol is matched to another dummy 1 may match to F:11, but 
dummy 14 will then not match to methyl:14.  This is because (as I understand), 
normal atoms can only be matched by queries, and cannot match them themselves.

Potential ideas to make this work as I’d like:

  1.  Override atom.Match in the python code – not sure that this would work 
since the C++ version of this function is what would be called during 
GetSubstructMatches
  2.  Override atom.Match in the C++ code – not quite sure how to do this, or 
what side affects it might have.  Ideally the changes I make would only affect 
this example (and other similar ones)
  3.  Make all atoms in both molecules QueryAtoms, but otherwise leave them 
unchanged.  I’m not quite sure how to do this!

Does anyone have any ideas for what the best approach here would be, or knows 
if there is already built in functionality for something like this?  I’d prefer 
to not use SMARTS to construct my molecules if possible, since I don’t really 
think of them as queries, just as other molecules in the system that happen to 
not be fully specified.

- Kovas

--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] back tracking descriptor names from RandomForest feature_importance

2018-08-22 Thread Ali Eftekhari
Hi Shojiro,

This might not be the most elegant and efficient way but it worked for what
I wanted to do.  I changed my apply function as below:
allDescp=[name[0] for name in Descriptors._descList]
for name in allDescp:
temp=MoleculeDescriptors.MolecularDescriptorCalculator([name])
df[name]=df['SMILES'].apply(lambda x:
temp.CalcDescriptors(Chem.MolFromSmiles(x))[0])

y=df['Target].values
X=np.array(df.drop('Target'))

features_list = X.columns.values[0::]

model= RandomForestRegressor(n_estimators = 1000, random_state = 42)
model.fit(X,y)
feature_importance = model.feature_importances_

threshold = 5
important_index = np.where(feature_importance > threshold)[0]
important_features = features_list[important_index]

Ali

On Tue, Aug 21, 2018 at 6:51 AM Ali Eftekhari 
wrote:

> hi Shojiro,
>
> Thanks for your response but print
> (np.argsort(rfregress.feature_importances_)[::-1]) returns the row indices
> but what I want is the column names so it can give me information which
> features are important.
>
> On Mon, Aug 20, 2018 at 9:31 PM Shojiro Shibayama 
> wrote:
>
>> Dear Ali,
>>
>> Please run first the following code, which may help you:
>>
>> ```python
>> import numpy as np
>> np.argsort(rfregress.feature_importances_)[::-1]
>> ```
>>
>> The `argsort` will return the indexes of the important features in
>> ascending order and [::-1] reverses the order.
>> The indexes for feature importance must correspond to the order of
>> variables (or the order in 'allDescp' of your code), so use these
>> variables, you'll get the information that you want.
>>
>> Sincerely yours,
>> Shojiro
>>
>>
>> On Tue, 21 Aug 2018 at 10:34, Ali Eftekhari 
>> wrote:
>>
>>> Hello rdkit,
>>>
>>> This might be trivial but I am beginner and don't know how to do it.
>>>
>>> I am building a simple model to predict target property.  I have pandas
>>> dataframe (df) whose columns are 'SMILES' and 'Target'.
>>>
>>> #calculating the descriptors as below:
>>> llDescp=[name[0] for name in Descriptors._descList]
>>> calc=MoleculeDescriptors.MolecularDescriptorCalculator(allDescp)
>>> df ['fp']=df['SMILES'].apply(lambda x:
>>> calc.CalcDescriptors(Chem.MolFromSmiles(x)))
>>>
>>> #converting  the fingerprint to numpy array
>>> y=df['Target'].values
>>> X=np.array(list(df['fp']))
>>>
>>> #preprocessing
>>> X_train, X_test, y_train, y_test=train_test_split(X, y, test_size=0.25,
>>> random_state=42)
>>> st=StandardScaler()
>>> X=st.fit_transform(X)
>>>
>>> #random forest model
>>> model=RandomForestRegressor(n_estimators=10)
>>> model.fit(X_train, y_train)
>>>
>>> My problem is that I don't know how to get the meaningful
>>> feature_importance.  The following will return the values of descriptors
>>> but there is no labels and so I don't know how to figure out which features
>>> are important.
>>>
>>> print (sorted (rfregress.feature_importances_))
>>>
>>> Thanks for your help!
>>>
>>>
>>>
>>>
>>> --
>>> Check out the vibrant tech community on one of the world's most
>>> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
>>> ___
>>> Rdkit-discuss mailing list
>>> Rdkit-discuss@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>>
>>
>>
>> --
>> 
>> The University of Tokyo
>> 2nd year Ph.D. candidate
>>   Shojiro Shibayama
>> 
>>
>
--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Alignment using LIgpargen file

2018-08-22 Thread Paolo Tosco

Dear Phuong,

it works for me with the latest RDKit release. I am wondering if you are 
using an older RDKit version where maybe the PDB parser was not 
assigning formal charges based on connectivity? Then you would have to 
manually set the formal charge of the protonated nitrogen atom (# 2 N) 
to +1.


Cheers,
p.


On 08/22/18 19:06, Phuong Chau wrote:

Hi Paolo,

I got this new error when I tried with different chemicals:

python 3DAlignmentwith3OAwithGRO.py CCNC\(N\)N_afterMD.pdb 
CC\[NH2\+\]CCN_Ligpargen.pdb

[12:02:45] Explicit valence for atom # 2 N, 4, is greater than permitted
Traceback (most recent call last):
  File "3DAlignmentwith3OAwithGRO.py", line 12, in 
    rdDistGeom.EmbedMolecule(prbMolwithH)
Boost.Python.ArgumentError: Python argument types in
    rdkit.Chem.rdDistGeom.EmbedMolecule(NoneType)
did not match C++ signature:
    EmbedMolecule(RDKit::ROMol {lvalue} mol, unsigned int 
maxAttempts=0, int randomSeed=-1, bool clearConfs=True, bool 
useRandomCoords=False, double boxSizeMult=2.0, bool randNegEig=True, 
unsigned int numZeroFail=1, boost::python::dict {lvalue} coordMap={}, 
double forceTol=0.001, bool ignoreSmoothingFailures=False)


The second argument I also downloaded it from Ligpargen but it gave me 
this error. I am not so sure why it happened.


I also attached the two files :

On Tue, Aug 21, 2018 at 3:19 PM, Paolo Tosco 
mailto:paolo.tosco.m...@gmail.com>> wrote:


Hi Phuong,

it does have hydrogens after the alignment:

$ ls
UNK_A1C198.pdb  align.py  drug.pdb

$ python align.py drug.pdb UNK_A1C198.pdb

$ ls
UNK.pdb UNK_A1C198.pdb  align.py  drug.pdb

$ cat UNK.pdb
ATOM  1  C00 UNK 1  37.884  56.016 51.678  1.00 
0.00   C
ATOM  2  C01 UNK 1  37.501  54.554 51.968  1.00 
0.00   C
ATOM  3  C02 UNK 1  36.085  54.316 52.531  1.00 
0.00   C
ATOM  4  C03 UNK 1  35.697  55.221 53.715  1.00 
0.00   C
ATOM  5  C04 UNK 1  34.907  56.485 53.304  1.00 
0.00   C
ATOM  6  C05 UNK 1  35.788  57.733 53.108  1.00 
0.00   C
ATOM  7  C06 UNK 1  36.227  57.998 51.654  1.00 
0.00   C
ATOM  8  C07 UNK 1  36.813  56.799 50.889  1.00 
0.00   C
ATOM  9  H08 UNK 1  38.821  56.001 51.080  1.00 
0.00   H
ATOM 10  H09 UNK 1  38.160  56.522 52.625  1.00 
0.00   H
ATOM 11  H0A UNK 1  37.598  53.969 51.027  1.00 
0.00   H
ATOM 12  H0B UNK 1  38.243  54.138 52.684  1.00 
0.00   H
ATOM 13  H0C UNK 1  35.325  54.368 51.723  1.00 
0.00   H
ATOM 14  H0D UNK 1  36.057  53.262 52.884  1.00 
0.00   H
ATOM 15  H0E UNK 1  35.031  54.622 54.375  1.00 
0.00   H
ATOM 16  H0F UNK 1  36.586  55.471 54.333  1.00 
0.00   H
ATOM 17  H0G UNK 1  34.203  56.712 54.134  1.00 
0.00   H
ATOM 18  H0H UNK 1  34.270  56.304 52.411  1.00 
0.00   H
ATOM 19  H0I UNK 1  35.198  58.620 53.428  1.00 
0.00   H
ATOM 20  H0J UNK 1  36.665  57.714 53.788  1.00 
0.00   H
ATOM 21  H0K UNK 1  36.982  58.814 51.667  1.00 
0.00   H
ATOM 22  H0M UNK 1  35.352  58.378 51.082  1.00 
0.00   H
ATOM 23  H0N UNK 1  35.991  56.144 50.537  1.00 
0.00   H
ATOM 24  H0O UNK 1  37.280  57.194 49.960  1.00 
0.00   H
CONECT    1    2    8    9   10
CONECT    2    3   11   12
CONECT    3    4   13   14
CONECT    4    5   15   16
CONECT    5    6   17   18
CONECT    6    7   19   20
CONECT    7    8   21   22
CONECT    8   23   24
END

p.


On 08/21/18 20:46, Phuong Chau wrote:

Hi Paolo,

I tried the code, it worked with the alignment but the new
chemical after alignment still does not have any H atoms. If I
add H after alignment, it has weird structure with H atoms. Would
you please show me how to fix this problem?

Best,
Phuong Chau

On Tue, Aug 21, 2018 at 8:01 AM, Paolo Tosco
mailto:paolo.tosco.m...@gmail.com>>
wrote:

Hi Phuong,

If you wish to retain Hs you just need to set removeHs =
False when you call MolFromPDBFile():

# align.py

import sys
from rdkit import Chem
from rdkit.Chem import rdDistGeom, rdMolAlign,
rdForceFieldHelpers
from rdkit import Chem

refMolwithH = Chem.MolFromPDBFile(sys.argv[1], removeHs = False)
s = sys.argv[2]
prbMolwithH = Chem.MolFromPDBFile(s, removeHs = False)
idx=s.find('_')
chemB= s[:idx]

rdDistGeom.EmbedMolecule(prbMolwithH)
 

Re: [Rdkit-discuss] Alignment using LIgpargen file

2018-08-22 Thread Phuong Chau
Hi Paolo,

I got this new error when I tried with different chemicals:

python 3DAlignmentwith3OAwithGRO.py CCNC\(N\)N_afterMD.pdb
CC\[NH2\+\]CCN_Ligpargen.pdb
[12:02:45] Explicit valence for atom # 2 N, 4, is greater than permitted
Traceback (most recent call last):
  File "3DAlignmentwith3OAwithGRO.py", line 12, in 
rdDistGeom.EmbedMolecule(prbMolwithH)
Boost.Python.ArgumentError: Python argument types in
rdkit.Chem.rdDistGeom.EmbedMolecule(NoneType)
did not match C++ signature:
EmbedMolecule(RDKit::ROMol {lvalue} mol, unsigned int maxAttempts=0,
int randomSeed=-1, bool clearConfs=True, bool useRandomCoords=False, double
boxSizeMult=2.0, bool randNegEig=True, unsigned int numZeroFail=1,
boost::python::dict {lvalue} coordMap={}, double forceTol=0.001, bool
ignoreSmoothingFailures=False)

The second argument I also downloaded it from Ligpargen but it gave me this
error. I am not so sure why it happened.

I also attached the two files :

On Tue, Aug 21, 2018 at 3:19 PM, Paolo Tosco 
wrote:

> Hi Phuong,
>
> it does have hydrogens after the alignment:
>
> $ ls
> UNK_A1C198.pdb  align.py  drug.pdb
>
> $ python align.py drug.pdb UNK_A1C198.pdb
>
> $ ls
> UNK.pdb  UNK_A1C198.pdb  align.py  drug.pdb
> $ cat UNK.pdb
> ATOM  1  C00 UNK 1  37.884  56.016  51.678  1.00
> 0.00   C
> ATOM  2  C01 UNK 1  37.501  54.554  51.968  1.00
> 0.00   C
> ATOM  3  C02 UNK 1  36.085  54.316  52.531  1.00
> 0.00   C
> ATOM  4  C03 UNK 1  35.697  55.221  53.715  1.00
> 0.00   C
> ATOM  5  C04 UNK 1  34.907  56.485  53.304  1.00
> 0.00   C
> ATOM  6  C05 UNK 1  35.788  57.733  53.108  1.00
> 0.00   C
> ATOM  7  C06 UNK 1  36.227  57.998  51.654  1.00
> 0.00   C
> ATOM  8  C07 UNK 1  36.813  56.799  50.889  1.00
> 0.00   C
> ATOM  9  H08 UNK 1  38.821  56.001  51.080  1.00
> 0.00   H
> ATOM 10  H09 UNK 1  38.160  56.522  52.625  1.00
> 0.00   H
> ATOM 11  H0A UNK 1  37.598  53.969  51.027  1.00
> 0.00   H
> ATOM 12  H0B UNK 1  38.243  54.138  52.684  1.00
> 0.00   H
> ATOM 13  H0C UNK 1  35.325  54.368  51.723  1.00
> 0.00   H
> ATOM 14  H0D UNK 1  36.057  53.262  52.884  1.00
> 0.00   H
> ATOM 15  H0E UNK 1  35.031  54.622  54.375  1.00
> 0.00   H
> ATOM 16  H0F UNK 1  36.586  55.471  54.333  1.00
> 0.00   H
> ATOM 17  H0G UNK 1  34.203  56.712  54.134  1.00
> 0.00   H
> ATOM 18  H0H UNK 1  34.270  56.304  52.411  1.00
> 0.00   H
> ATOM 19  H0I UNK 1  35.198  58.620  53.428  1.00
> 0.00   H
> ATOM 20  H0J UNK 1  36.665  57.714  53.788  1.00
> 0.00   H
> ATOM 21  H0K UNK 1  36.982  58.814  51.667  1.00
> 0.00   H
> ATOM 22  H0M UNK 1  35.352  58.378  51.082  1.00
> 0.00   H
> ATOM 23  H0N UNK 1  35.991  56.144  50.537  1.00
> 0.00   H
> ATOM 24  H0O UNK 1  37.280  57.194  49.960  1.00
> 0.00   H
> CONECT1289   10
> CONECT23   11   12
> CONECT34   13   14
> CONECT45   15   16
> CONECT56   17   18
> CONECT67   19   20
> CONECT78   21   22
> CONECT8   23   24
> END
>
> p.
>
>
> On 08/21/18 20:46, Phuong Chau wrote:
>
> Hi Paolo,
>
> I tried the code, it worked with the alignment but the new chemical after
> alignment still does not have any H atoms. If I add H after alignment, it
> has weird structure with H atoms. Would you please show me how to fix this
> problem?
>
> Best,
> Phuong Chau
>
> On Tue, Aug 21, 2018 at 8:01 AM, Paolo Tosco 
> wrote:
>
>> Hi Phuong,
>>
>> If you wish to retain Hs you just need to set removeHs = False when you
>> call MolFromPDBFile():
>>
>> # align.py
>>
>> import sys
>> from rdkit import Chem
>> from rdkit.Chem import rdDistGeom, rdMolAlign, rdForceFieldHelpers
>> from rdkit import Chem
>>
>> refMolwithH = Chem.MolFromPDBFile(sys.argv[1], removeHs = False)
>> s = sys.argv[2]
>> prbMolwithH = Chem.MolFromPDBFile(s, removeHs = False)
>> idx=s.find('_')
>> chemB= s[:idx]
>>
>> rdDistGeom.EmbedMolecule(prbMolwithH)
>> rdForceFieldHelpers.UFFOptimizeMolecule(prbMolwithH)
>>
>> ##Alignment
>> pyO3A = rdMolAlign.GetO3A(prbMolwithH, refMolwithH)
>> score = pyO3A.Align()
>>
>> ##3D coords of Chem B after alignmnet
>> Chem.MolToPDBFile(prbMolwithH,'{}.pdb'.format(chemB))
>>
>> and then
>>
>> ./align.py drug.pdb UNK_A1C198.pdb
>> Cheers,
>> p.
>>
>> On 08/20/18 19:02, Phuong Chau wrote:
>>
>> Hello everyone,
>>
>> I am trying to align two chemicals using their pdb files with the
>> following script:
>>
>> *refMolwithH = Chem.MolFromPDBFile(sys.argv[1])*
>> *s = sys.argv[2]*
>> *prbMolwithH = Chem.MolFromPDBFile(s)*
>> *idx=s.find('_')*
>> *chemB= s[:idx]*
>>
>>