Hi folks,
I turned in the first draft of my PhD thesis yesterday, so I finally 
have the time to address my long to-do list for scikit-learn!  I'll be 
starting with finishing up my tutorial on machine learning for Astronomy 
[1].  First of all, lots of thanks to Jaques for doing a close 
read-through of the current iteration and offering a lot of good 
detailed feedback.  Because of his diligence, I think the tutorial is 
close to being ready for merge, though I still plan to modify the 
content a bit, and there are still several structural issues I'd like 
feedback on:

1) The tutorial examples make use of several astronomy-specific 
datasets.  These primarily come from publicly available data at the 
Sloan Digital Sky Survey [2], but I've done some preprocessing and 
loaded the datasets onto my own web site.  This location for the data 
files should be long-lived: the website is associated with a 
python-based statistics textbook I'm coauthoring, which will be 
published early 2013.  Currently, I've put the loaders from these 
datasets in the example plotting scripts themselves.  Should the loaders 
be moved to sklearn.datasets, so the data can be used for general 
examples which are not associated with the tutorial? Or do you think 
it's OK to have tutorial-specific loaders left out of the general 
sklearn package?

2) In the tutorial files, I'd like to show in-line example code for the 
loading, processing, and plotting of these datasets.  I think this may 
pose a problem for doctests, because it could result in large downloads 
and/or matplotlib plotting when nosetests are run. My gut feeling is 
that it would be better to have nosetests ignore the code snippets in 
the tutorial.  Any input on this?  What's the best way to tell doctests 
to ignore these code blocks (short of an ignore directive on each line)?

3) Currently the exercises follow the format that Olivier set up in his 
tutorials, with a "skeleton" and a "solution" for each example script.  
On Fernando's suggestion, I'd like to move to using ipython notebooks 
for these examples.  I think it leads to a much smoother interface, 
especially the ability to try-out code snippets one-by-one, avoiding 
errors associated with running incomplete code.  This may lead to a 
problem: ipython notebook is still relatively new, and not everyone can 
read *.ipynb files.  Any input on whether I should remove the current 
skeleton/solution scripts in favor of ipython notebooks, or try to 
retain both versions of each example for broader compatibility?

4) Any other issues I'm not thinking of?

Thanks for taking the time to read through all this.  I should mention 
that I'm working on this now in preparation for the scikit-learn 
tutorial at Scipy 2012 in Austin two weeks from now.  I hope to see some 
of you there!
    Jake

[1] https://github.com/scikit-learn/scikit-learn/pull/837
[2] http://www.sdss.org/

------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to