Hi Shojiro,
This might not be the most elegant and efficient way but it worked for what
I wanted to do. I changed my apply function as below:
allDescp=[name[0] for name in Descriptors._descList]
for name in allDescp:
temp=MoleculeDescriptors.MolecularDescriptorCalculator([name])
df[name]=df['SMILES'].apply(lambda x:
temp.CalcDescriptors(Chem.MolFromSmiles(x))[0])
y=df['Target].values
X=np.array(df.drop('Target'))
features_list = X.columns.values[0::]
model= RandomForestRegressor(n_estimators = 1000, random_state = 42)
model.fit(X,y)
feature_importance = model.feature_importances_
threshold = 5
important_index = np.where(feature_importance > threshold)[0]
important_features = features_list[important_index]
Ali
On Tue, Aug 21, 2018 at 6:51 AM Ali Eftekhari <a.b.eftekh...@gmail.com>
wrote:
> hi Shojiro,
>
> Thanks for your response but print
> (np.argsort(rfregress.feature_importances_)[::-1]) returns the row indices
> but what I want is the column names so it can give me information which
> features are important.
>
> On Mon, Aug 20, 2018 at 9:31 PM Shojiro Shibayama <notify.p...@gmail.com>
> wrote:
>
>> Dear Ali,
>>
>> Please run first the following code, which may help you:
>>
>> ```python
>> import numpy as np
>> np.argsort(rfregress.feature_importances_)[::-1]
>> ```
>>
>> The `argsort` will return the indexes of the important features in
>> ascending order and [::-1] reverses the order.
>> The indexes for feature importance must correspond to the order of
>> variables (or the order in 'allDescp' of your code), so use these
>> variables, you'll get the information that you want.
>>
>> Sincerely yours,
>> Shojiro
>>
>>
>> On Tue, 21 Aug 2018 at 10:34, Ali Eftekhari <a.b.eftekh...@gmail.com>
>> wrote:
>>
>>> Hello rdkit,
>>>
>>> This might be trivial but I am beginner and don't know how to do it.
>>>
>>> I am building a simple model to predict target property. I have pandas
>>> dataframe (df) whose columns are 'SMILES' and 'Target'.
>>>
>>> #calculating the descriptors as below:
>>> llDescp=[name[0] for name in Descriptors._descList]
>>> calc=MoleculeDescriptors.MolecularDescriptorCalculator(allDescp)
>>> df ['fp']=df['SMILES'].apply(lambda x:
>>> calc.CalcDescriptors(Chem.MolFromSmiles(x)))
>>>
>>> #converting the fingerprint to numpy array
>>> y=df['Target'].values
>>> X=np.array(list(df['fp']))
>>>
>>> #preprocessing
>>> X_train, X_test, y_train, y_test=train_test_split(X, y, test_size=0.25,
>>> random_state=42)
>>> st=StandardScaler()
>>> X=st.fit_transform(X)
>>>
>>> #random forest model
>>> model=RandomForestRegressor(n_estimators=10)
>>> model.fit(X_train, y_train)
>>>
>>> My problem is that I don't know how to get the meaningful
>>> feature_importance. The following will return the values of descriptors
>>> but there is no labels and so I don't know how to figure out which features
>>> are important.
>>>
>>> print (sorted (rfregress.feature_importances_))
>>>
>>> Thanks for your help!
>>>
>>>
>>>
>>>
>>> ------------------------------------------------------------------------------
>>> Check out the vibrant tech community on one of the world's most
>>> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
>>> _______________________________________________
>>> Rdkit-discuss mailing list
>>> Rdkit-discuss@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>>
>>
>>
>> --
>> ----
>> The University of Tokyo
>> 2nd year Ph.D. candidate
>> Shojiro Shibayama
>> ----
>>
>
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss