Re: [scikit-learn] Need help in dealing with large dataset

Sebastian Raschka Mon, 05 Mar 2018 09:31:00 -0800

Like Guillaume suggested, you don't want to load the whole array into memory if 
it's that large. There are many different ways for how to deal with this. The 
most naive way would be to break up your NumPy array into smaller NumPy array 
and load them iteratively with a running accuracy calculation. My suggestion 
would be to create a HDF5 file from the NumPy array where each entry is an 
image. If it's just the test images, you can also save a batch of them as entry 
because you don't need to shuffle them anyway.


Ultimately, the recommendation based on the sweet spot between performance and 
convenience depends on what DL framework you use. Since this is a scikit-learn 
forum, I suppose you are using sklearn objects (although, I am not aware that 
sklearn has CNNs). The DataLoader in PyTorch is universally useful though and 
can come in handy no matter what CNN implementation you use. I have some 
examples here if that helps:

- 
https://github.com/rasbt/deep-learning-book/blob/master/code/model_zoo/pytorch_ipynb/custom-data-loader-celeba.ipynb
- 
https://github.com/rasbt/deep-learning-book/blob/master/code/model_zoo/pytorch_ipynb/custom-data-loader-csv.ipynb

Best,
Sebastian


> On Mar 5, 2018, at 12:13 PM, Guillaume Lemaître <[email protected]> 
> wrote:
> 
> If you work with deep net you need to check the utils from the deep net 
> library.
> For instance in keras, you should create a batch generator if you need to 
> deal with large dataset.
> In patch torch you can use the data loader which and the ImageFolder from 
> torchvision which manage
> the loading for you.
> 
> On 5 March 2018 at 17:19, CHETHAN MURALI <[email protected]> wrote:
> Dear All,
> 
> I am working on building a CNN model for image classification problem.
> As par of it I have converted all my test images to numpy array.
> 
> Now when I am trying to split the array into training and test set I am 
> getting memory error.
> Details are as below:
> 
> X = np.load("./data/X_train.npy", mmap_mode='r')
> 
> train_pct_index 
> = int(0.8 * len(X))
> 
> X_train
> , X_test = X[:train_pct_index], X[train_pct_index:]
> 
> X_train 
> = X_train.reshape(X_train.shape[0], 256, 256, 3)
> 
> 
> X_train 
> = X_train.astype('float32')
> 
> 
> 
> -------------------------------------------------
> MemoryError                               Traceback (most recent call last)
> <ipython-input-46-9180807e01dc> in <module>()
> 
>       
> 2 print("Normalizing Data")
> 
>       
> 3
>  
> 
> ----> 4 X_train = X_train.astype('float32')
> More information:
> 
> 1. my python version is
> 
> python --
> version
> 
> Python 3.6.4 :: Anaconda custom (64-bit)
> 2. I am running the code in ubuntu ubuntu 16.04.
> 
> 3. I have 32GB RAM
> 
> 4. X_train.npy file that I have loaded to np.array is of size 20GB
> 
> print("X_train Shape: ", X_train.shape)
> 
> X_train 
> Shape:  (85108, 256, 256, 3)
> I would be really glad if you can help me to overcome this problem.
> 
> Regards,
> -
> Chethan
> 
> _______________________________________________
> scikit-learn mailing list
> [email protected]
> https://mail.python.org/mailman/listinfo/scikit-learn
> 
> 
> 
> 
> -- 
> Guillaume Lemaitre
> INRIA Saclay - Parietal team
> Center for Data Science Paris-Saclay
> https://glemaitre.github.io/
> _______________________________________________
> scikit-learn mailing list
> [email protected]
> https://mail.python.org/mailman/listinfo/scikit-learn

_______________________________________________
scikit-learn mailing list
[email protected]
https://mail.python.org/mailman/listinfo/scikit-learn

Re: [scikit-learn] Need help in dealing with large dataset

Reply via email to