Dear Scikit-learn community,
I have been reading some examples in https://scikit-learn.org/stable/auto_examples/ensemble/plot_forest_importances.html#feature-importance-based-on-mean-decrease-in-impurity about the permutation importance that can be assessed after fitting a tree-based model (e.g. RandomForestClassifier). However, I have noticed a discrepancy that I would like to mention. If a one-hot-encoding step is used before model fitting, the `.feature_importances_` attribute includes importances for all the levels of the transformed categorical features (e.g. for gender, we get 2 importances for males & females, respectively. When I apply the `permutation_importance<https://scikit-learn.org/stable/modules/generated/sklearn.inspection.permutation_importance.html#sklearn.inspection.permutation_importance>` functions though, the outputs correspond to the non-transformed data. To illustrate this, I include a toy example in .py format. Best, Makis
#!/usr/bin/env python # coding: utf-8 # In[1]: from sklearn.model_selection import train_test_split import pandas as pd from sklearn.compose import ColumnTransformer from sklearn.pipeline import Pipeline from sklearn.preprocessing import StandardScaler, OneHotEncoder from sklearn.ensemble import RandomForestClassifier import matplotlib.pyplot as plt # In[2]: df = pd.DataFrame({'gender':['Male', 'Female', 'Male', 'Female', 'Female', 'Female'], 'group':[1,1,2,2,3,3], 'score':[10,30,20,40, 50,90]}) df # In[5]: # Make Pipeline numeric_features = ['score'] categorical_features = ['gender'] numeric_transformer = Pipeline( steps=[ ("scaler", StandardScaler()) ] ) categorical_transformer = OneHotEncoder(handle_unknown="ignore") preprocessor = ColumnTransformer( transformers=[ ("num", numeric_transformer, numeric_features), ("cat", categorical_transformer, categorical_features), ] ) # make model model = RandomForestClassifier(n_estimators=50, class_weight="balanced",random_state=42, n_jobs=-1) pipe = Pipeline(steps=[("preprocessor", preprocessor), ("classifier", model)]) # fit model X = df.loc[:, (df.columns!="group")] y = df.group X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) pipe.fit(X_train, y_train) # In[6]: # Plot feature importance plt.figure(figsize=(12,10)) importances = pipe[-1].feature_importances_ feature_names = pipe[:-1].get_feature_names_out().tolist() _=plt.bar(range(len(feature_names)), pipe[-1].feature_importances_) _=plt.xticks(range(len(feature_names)), feature_names, rotation=90) plt.title("Feature importances using MDI") plt.ylabel("Mean decrease in impurity") # In[16]: # Try MDA: https://scikit-learn.org/stable/auto_examples/ensemble/plot_forest_importances.html#feature-importance-based-on-mean-decrease-in-impurity from sklearn.inspection import permutation_importance result = permutation_importance(pipe, X_test, y_test, n_repeats=1, random_state=42, n_jobs=2) # In[17]: result.importances_mean.shape # In[18]: # here the importances_mean has only 2 elements corresponding to Score and gender variables. # BUT in the MDI case, pipe[-1].feature_importances_ includes the one-hot-encoded columns as well # In[ ]:
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn