Re: [julia-users] 900mb csv loading in Julia failed: memory comparison vs python pandas and R

Grey Marsh Tue, 13 Oct 2015 21:03:08 -0700

@Jacob, I tried your approach. Somehow it got stuck in the "@time ds = 
DataStreams.DataTable(f)" line. After 15 minutes running, julia is using 
~500mb and 1 cpu core with no sign of end. The memory use has been almost 
same for the whole duration of 15 minutes. I'm letting it run, hoping that 
it finishes after some time.


>From your run, I can see it needs 12gb memory which is higher than my 
machine memory of 8gb. could it be the problem? 

On Wednesday, October 14, 2015 at 2:28:09 AM UTC+5:30, Jacob Quinn wrote:
>
> I'm hesitant to suggest, but if you're in a bind, I have an experimental 
> package for fast CSV reading. The API has stabilized somewhat over the last 
> week and I'm planning a more broad release soon, but I'd still consider it 
> alpha mode. That said, if anyone's willing to give it a drive, you just 
> need to
>
> Pkg.add("Libz")
> Pkg.add("NullableArrays")
> Pkg.clone("https://github.com/quinnj/DataStreams.jl";)
> Pkg.clone("https://github.com/quinnj/CSV.jl";)
>
> With the original file referenced here I get:
>
> julia> reload("CSV")
>
> julia> f = CSV.Source("/Users/jacobquinn/Downloads/train.csv";null="NA")
> CSV.Source: "/Users/jacobquinn/Downloads/train.csv"
> delim: ','
> quotechar: '"'
> escapechar: '\\'
> null: "NA"
> schema: 
> DataStreams.Schema(UTF8String["ID","VAR_0001","VAR_0002","VAR_0003","VAR_0004","VAR_0005","VAR_0006","VAR_0007","VAR_0008","VAR_0009"
>  
>  … 
>  
> "VAR_1926","VAR_1927","VAR_1928","VAR_1929","VAR_1930","VAR_1931","VAR_1932","VAR_1933","VAR_1934","target"],[Int64,DataStreams.PointerString,Int64,Int64,Int64,DataStreams.PointerString,Int64,Int64,DataStreams.PointerString,DataStreams.PointerString
>  
>  … 
>  
> Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,DataStreams.PointerString,Int64],145231,1934)
> dateformat: Base.Dates.DateFormat(Base.Dates.Slot[],"","english")
>
>
> julia> @time ds = DataStreams.DataTable(f)
>  43.513800 seconds (694.00 M allocations: 12.775 GB, 2.55% gc time)
>
>
> You can convert the result to a DataFrame with:
>
> function DataFrames.DataFrame(dt::DataStreams.DataTable)
>     cols = dt.schema.cols
>     data = Array(Any,cols)
>     types = DataStreams.types(dt)
>     for i = 1:cols
>         data[i] = DataStreams.column(dt,i,types[i])
>     end
>     return DataFrame(data,Symbol[symbol(x) for x in dt.schema.header]) 
> end
>
>
> -Jacob
>
> On Tue, Oct 13, 2015 at 2:40 PM, feza <[email protected] <javascript:>> 
> wrote:
>
>> Finally was able to load it, but the process   consumes a ton of memory.
>> julia> @time train = readtable("./test.csv");
>> 124.575362 seconds (376.11 M allocations: 13.438 GB, 10.77% gc time)
>>
>>
>>
>> On Tuesday, October 13, 2015 at 4:34:05 PM UTC-4, feza wrote:
>>>
>>> Same here on a 12gb ram machine 
>>>
>>>                _
>>>    _       _ _(_)_     |  A fresh approach to technical computing
>>>   (_)     | (_) (_)    |  Documentation: http://docs.julialang.org
>>>    _ _   _| |_  __ _   |  Type "?help" for help.
>>>   | | | | | | |/ _` |  |
>>>   | | |_| | | | (_| |  |  Version 0.5.0-dev+429 (2015-09-29 09:47 UTC)
>>>  _/ |\__'_|_|_|\__'_|  |  Commit f71e449 (14 days old master)
>>> |__/                   |  x86_64-w64-mingw32
>>>
>>> julia> using DataFrames                                                 
>>>            
>>>                                                                         
>>>            
>>> julia> train = readtable("./test.csv");                                 
>>>            
>>> ERROR: OutOfMemoryError()                                               
>>>            
>>>  in resize! at array.jl:452                                             
>>>            
>>>  in readnrows! at 
>>> C:\Users\Mustafa\.julia\v0.5\DataFrames\src\dataframe\io.jl:164  
>>>  in readtable! at 
>>> C:\Users\Mustafa\.julia\v0.5\DataFrames\src\dataframe\io.jl:767  
>>>  in readtable at 
>>> C:\Users\Mustafa\.julia\v0.5\DataFrames\src\dataframe\io.jl:847   
>>>  in readtable at 
>>> C:\Users\Mustafa\.julia\v0.5\DataFrames\src\dataframe\io.jl:893   
>>>
>>>
>>>
>>>
>>>
>>> On Tuesday, October 13, 2015 at 3:47:58 PM UTC-4, Yichao Yu wrote:
>>>>
>>>>
>>>> On Oct 13, 2015 2:47 PM, "Grey Marsh" <[email protected]> wrote:
>>>>
>>>> Which julia version are you using. There's sime gc tweak on 0.4 for 
>>>> that.
>>>>
>>>> >
>>>> > I was trying to load the training dataset from springleaf marketing 
>>>> response on Kaggle. The csv is 921 mb, has 145321 row and 1934 columns. My 
>>>> machine has 8 gb ram and julia ate 5.8gb+ memory after that I stopped 
>>>> julia 
>>>> as there was barely any memory left for OS to function properly. It took 
>>>> about 5-6 minutes later for the incomplete operation. I've windows 8  
>>>> 64bit. Used the following code to read the csv to Julia:
>>>> >
>>>> > using DataFrames
>>>> > train = readtable("C:\\train.csv")
>>>> >
>>>> > Next I tried to to load the same file in python: 
>>>> >
>>>> > import pandas as pd
>>>> > train = pd.read_csv("C:\\train.csv")
>>>> >
>>>> > This took ~2.4gb memory, about a minute time
>>>> >
>>>> > Checking the same in R again:
>>>> > df = read.csv('E:/Libraries/train.csv', as.is = T)
>>>> >
>>>> > This took 2-3 minutes and consumes 3.5gb mem on the same machine. 
>>>> >
>>>> > Why such discrepancy and why Julia even fails to load the csv before 
>>>> running out of memory? If there is any better way to get the file loaded 
>>>> in 
>>>> Julia?
>>>> >
>>>> >
>>>>
>>>
>

Re: [julia-users] 900mb csv loading in Julia failed: memory comparison vs python pandas and R

Reply via email to