On Mon, 2021-09-20 at 19:52 +0200, Christian Kastner wrote: > > > > Or should we not build these jupyter notebooks for the -doc package? > > I don't think anyone would stop you from packaging the datasets but to > be honest, I think that would be overkill. The -doc package has a > popcon > of 93, and I would assume that (like me) most users of scikit-learn use > upstream's online documentation directly.
Many machine learning-related packages require external datasets, and the upstream usually provide APIs for the users to automatically download them if they are really useful for a large number of audience. I vote for "packaging a dataset is not necessary", and we may use pytest marker to skip the tests requiring external data. I refrained from uploading any datasets except for $ apt list dataset\* Listing... Done dataset-fashion-mnist/unstable,unstable,now as it can be used as a universal sanity test dataset for any machine learning tool sanity test dataset. (in academics, people use the dataset named MNIST. the above Fashion-MNIST is an MIT-licensed alternative).

