I was trying to load the training dataset from springleaf marketing response 
<https://www.kaggle.com/c/springleaf-marketing-response> on Kaggle. The csv 
is 921 mb, has 145321 row and 1934 columns. My machine has 8 gb ram and 
julia ate 5.8gb+ memory after that I stopped julia as there was barely any 
memory left for OS to function properly. It took about 5-6 minutes later 
for the incomplete operation. I've windows 8  64bit. Used the following 
code to read the csv to Julia:

using DataFrames
train = readtable("C:\\train.csv")

Next I tried to to load the same file in python: 

import pandas as pd
train = pd.read_csv("C:\\train.csv")

This took ~2.4gb memory, about a minute time

Checking the same in R again:
df = read.csv('E:/Libraries/train.csv', as.is = T)

This took 2-3 minutes and consumes 3.5gb mem on the same machine. 

Why such discrepancy and why Julia even fails to load the csv before 
running out of memory? If there is any better way to get the file loaded in 
Julia?


Reply via email to