Hi folks,
I turned in the first draft of my PhD thesis yesterday, so I finally
have the time to address my long to-do list for scikit-learn! I'll be
starting with finishing up my tutorial on machine learning for Astronomy
[1]. First of all, lots of thanks to Jaques for doing a close
read-through of the current iteration and offering a lot of good
detailed feedback. Because of his diligence, I think the tutorial is
close to being ready for merge, though I still plan to modify the
content a bit, and there are still several structural issues I'd like
feedback on:
1) The tutorial examples make use of several astronomy-specific
datasets. These primarily come from publicly available data at the
Sloan Digital Sky Survey [2], but I've done some preprocessing and
loaded the datasets onto my own web site. This location for the data
files should be long-lived: the website is associated with a
python-based statistics textbook I'm coauthoring, which will be
published early 2013. Currently, I've put the loaders from these
datasets in the example plotting scripts themselves. Should the loaders
be moved to sklearn.datasets, so the data can be used for general
examples which are not associated with the tutorial? Or do you think
it's OK to have tutorial-specific loaders left out of the general
sklearn package?
2) In the tutorial files, I'd like to show in-line example code for the
loading, processing, and plotting of these datasets. I think this may
pose a problem for doctests, because it could result in large downloads
and/or matplotlib plotting when nosetests are run. My gut feeling is
that it would be better to have nosetests ignore the code snippets in
the tutorial. Any input on this? What's the best way to tell doctests
to ignore these code blocks (short of an ignore directive on each line)?
3) Currently the exercises follow the format that Olivier set up in his
tutorials, with a "skeleton" and a "solution" for each example script.
On Fernando's suggestion, I'd like to move to using ipython notebooks
for these examples. I think it leads to a much smoother interface,
especially the ability to try-out code snippets one-by-one, avoiding
errors associated with running incomplete code. This may lead to a
problem: ipython notebook is still relatively new, and not everyone can
read *.ipynb files. Any input on whether I should remove the current
skeleton/solution scripts in favor of ipython notebooks, or try to
retain both versions of each example for broader compatibility?
4) Any other issues I'm not thinking of?
Thanks for taking the time to read through all this. I should mention
that I'm working on this now in preparation for the scikit-learn
tutorial at Scipy 2012 in Austin two weeks from now. I hope to see some
of you there!
Jake
[1] https://github.com/scikit-learn/scikit-learn/pull/837
[2] http://www.sdss.org/
------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and
threat landscape has changed and how IT managers can respond. Discussions
will include endpoint security, mobile security and the latest in malware
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general