On Wed, Dec 4, 2013 at 9:58 AM, Olivier Grisel <[email protected]> wrote: > As a user I must confess that I like the flat numpy API, both in > interactive sessions and in regular code. The main con is that it's > often hard to find the source code of a particular class or function, > especially when it's a builtin object from a CPython extension. > Fortunately in our case, most of the public API is made of regular > Python classes / functions with a __file__ attribute and in turn wrap > private compiled extensions. > > However flattening the scikit-learn API would feel weird sometimes. > For instance I find skl.SGDClassifier (without the > sklearn.linear_model) misleading. The sklearn.linear_model namespace > is informative in that case. A feed forward 2 layers neural net is > technically also a SGD base Classifier. But maybe the SGDClassifier > name is just bad and should be SGDLinearClassifier in general > independently of the flat namespace. > > I also feel like Gael that providing two official public APIs, one for > interactive scripting, the other for (clean | pure) application code > to be confusing, especially for newcomers. > > Let add another option to Joel proposals: > > Option #5 to Joel's proposals: have a __all__ list in sklearn.__init__ > that imports the first level public package names (e.g. everything but > utils basically) to make it possible to do: > > import sklearn as skl > > skl.grid_search.GridSearchCV(skl.pipeline.Pipeline([ > ('sel', skl.feature_selection.SelectKBest(skl.feature_selection.chi2)), > ('clf', skl.svm.LinearSVC()) > ], {'clf__C': [.1, 1.]})) > > > Possible cons: > > - import time might slow down a bit: to be benched to measure whether > this is negligible or not > - we should be careful in the sklearn.__init__ to import stuff in the > right order to avoid introducing circular dependencies but our test > suite should check that right away for us
>From what I can see in the sklearn source, your __all__ in the __init__py works very similar to our api.py. You are also collecting the imports, but in the __init__ instead of the api. We added additionally the main model classes into the flat namespace, but statsmodels has fewer model classes than sklearn. The main reason that we added the parallel api system, and moved all imports from the __init__.py into the api.py is to reduce import times. If we do heavy calculations, then the import times don't matter. However, when I want to use or work on something in statsmodels.stats, then I can write fast starting scripts, as long as I avoid importing large parts of statsmodels, pandas (and the implied matplotlib, scipy, ...). The other main reason to avoid imports is that using joblib is only worth it if we can put a lot of work into each process, "fork" (on Windows). The main reason we don't use joblib much inside statsmodels is that most of our current algorithms don't take enough time to make parallel processing worthwile. We have cross-validation and bootstrap only at a few places so far, and some of the slow code cannot be parallelized. Josef > > -- > Olivier > > ------------------------------------------------------------------------------ > Sponsored by Intel(R) XDK > Develop, test and display web and hybrid apps with a single code base. > Download it for free now! > http://pubads.g.doubleclick.net/gampad/clk?id=111408631&iu=/4140/ostg.clktrk > _______________________________________________ > Scikit-learn-general mailing list > [email protected] > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general ------------------------------------------------------------------------------ Sponsored by Intel(R) XDK Develop, test and display web and hybrid apps with a single code base. Download it for free now! http://pubads.g.doubleclick.net/gampad/clk?id=111408631&iu=/4140/ostg.clktrk _______________________________________________ Scikit-learn-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
