Re: [scikit-learn] Creating dataset

Nicolas Hug Sun, 08 Nov 2020 08:21:18 -0800

load_iris() reads a csv file, and then retrieves/sets some other infolike the feature names and a description of the dataset (which comesfrom another file)

Then it packs everything into a Bunch object which is basically a fancydict:https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/utils/__init__.py#L63

You can take inspiration from the source code(https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/datasets/_base.py#L396)if you want to replicate what fetch_xxx() does, but you do not need aBunch at all to follow the PCA article that you mentioned. As previouslynoted, you just need to understand what each piece is doing at a highlevel and slightly modify the input to the functions according to yourneeds.




On 11/8/20 2:19 PM, Mahmood Naderan wrote:

>You need to understand what the different statements are doing; just
>as you need to understand what processing you apply on your data
>(whether it's preprocessing or learning) to properly use any machine
>learning tool.
I know, but the problem is that the csv file of the iris doesn't havesuch information and as I said, I think there are some additionalsteps that I don't know exactly what they are.
For example, if you look at~/.local/lib/python3.6/site-packages/sklearn/datasets/data/iris.csvyou will see
150,4,setosa,versicolor,virginica
5.1,3.5,1.4,0.2,0
4.9,3.0,1.4,0.2,0
...
So, the first line means 150 instances (rows) with 4 columns and threeiris types.
However, when I use

iris = load_iris()
print(iris)

I see a lot of metadata, such as:

{'data': array([[5.1, 3.5, 1.4, 0.2],
       [4.9, 3. , 1.4, 0.2],
...
[5.9, 3. , 5.1, 1.8]]), 'target': array([0, 0,...]), 'frame':None, 'target_names': array(['setosa', 'versicolor', 'virginica'],dtype='<U10'), 'DESCR': '.. _iris_dataset:\n\nIris plantsdataset\n--------------------\n\n**Data Set Characteristics:**\n\n :Number of Instances: 150 (50 in each of three classes)\n :Numberof Attributes: 4 numeric, predictive attributes and the class\n :Attribute Information:\n - sepal length in cm\n -sepal width in cm\n - petal length in cm\n - petal widthin cm\n - class:\n - Iris-Setosa\n - Iris-Versicolour\n - Iris-Virginica\n \n
The question is how these metadata are created and stored in this package?
I mean, what does

from sklearn.datasets import load_iris
do with the csv file? If I know, then I am also able to create asimilar dataset.
Regards,
Mahmood

_______________________________________________
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn

_______________________________________________
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn

Re: [scikit-learn] Creating dataset

Reply via email to