Re: [julia-users] 900mb csv loading in Julia failed: memory comparison vs python pandas and R

Jacob Quinn Tue, 13 Oct 2015 13:59:03 -0700

I'm hesitant to suggest, but if you're in a bind, I have an experimental
package for fast CSV reading. The API has stabilized somewhat over the last
week and I'm planning a more broad release soon, but I'd still consider it
alpha mode. That said, if anyone's willing to give it a drive, you just
need to


Pkg.add("Libz")
Pkg.add("NullableArrays")
Pkg.clone("https://github.com/quinnj/DataStreams.jl";)
Pkg.clone("https://github.com/quinnj/CSV.jl";)

With the original file referenced here I get:

julia> reload("CSV")

julia> f = CSV.Source("/Users/jacobquinn/Downloads/train.csv";null="NA")
CSV.Source: "/Users/jacobquinn/Downloads/train.csv"
delim: ','
quotechar: '"'
escapechar: '\\'
null: "NA"
schema:
DataStreams.Schema(UTF8String["ID","VAR_0001","VAR_0002","VAR_0003","VAR_0004","VAR_0005","VAR_0006","VAR_0007","VAR_0008","VAR_0009"
 …
 
"VAR_1926","VAR_1927","VAR_1928","VAR_1929","VAR_1930","VAR_1931","VAR_1932","VAR_1933","VAR_1934","target"],[Int64,DataStreams.PointerString,Int64,Int64,Int64,DataStreams.PointerString,Int64,Int64,DataStreams.PointerString,DataStreams.PointerString
 …
 
Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,DataStreams.PointerString,Int64],145231,1934)
dateformat: Base.Dates.DateFormat(Base.Dates.Slot[],"","english")


julia> @time ds = DataStreams.DataTable(f)
 43.513800 seconds (694.00 M allocations: 12.775 GB, 2.55% gc time)


You can convert the result to a DataFrame with:

function DataFrames.DataFrame(dt::DataStreams.DataTable)
    cols = dt.schema.cols
    data = Array(Any,cols)
    types = DataStreams.types(dt)
    for i = 1:cols
        data[i] = DataStreams.column(dt,i,types[i])
    end
    return DataFrame(data,Symbol[symbol(x) for x in dt.schema.header])
end


-Jacob

On Tue, Oct 13, 2015 at 2:40 PM, feza <[email protected]> wrote:

> Finally was able to load it, but the process   consumes a ton of memory.
> julia> @time train = readtable("./test.csv");
> 124.575362 seconds (376.11 M allocations: 13.438 GB, 10.77% gc time)
>
>
>
> On Tuesday, October 13, 2015 at 4:34:05 PM UTC-4, feza wrote:
>>
>> Same here on a 12gb ram machine
>>
>>                _
>>    _       _ _(_)_     |  A fresh approach to technical computing
>>   (_)     | (_) (_)    |  Documentation: http://docs.julialang.org
>>    _ _   _| |_  __ _   |  Type "?help" for help.
>>   | | | | | | |/ _` |  |
>>   | | |_| | | | (_| |  |  Version 0.5.0-dev+429 (2015-09-29 09:47 UTC)
>>  _/ |\__'_|_|_|\__'_|  |  Commit f71e449 (14 days old master)
>> |__/                   |  x86_64-w64-mingw32
>>
>> julia> using DataFrames
>>
>>
>>
>> julia> train = readtable("./test.csv");
>>
>> ERROR: OutOfMemoryError()
>>
>>  in resize! at array.jl:452
>>
>>  in readnrows! at
>> C:\Users\Mustafa\.julia\v0.5\DataFrames\src\dataframe\io.jl:164
>>  in readtable! at
>> C:\Users\Mustafa\.julia\v0.5\DataFrames\src\dataframe\io.jl:767
>>  in readtable at
>> C:\Users\Mustafa\.julia\v0.5\DataFrames\src\dataframe\io.jl:847
>>  in readtable at
>> C:\Users\Mustafa\.julia\v0.5\DataFrames\src\dataframe\io.jl:893
>>
>>
>>
>>
>>
>> On Tuesday, October 13, 2015 at 3:47:58 PM UTC-4, Yichao Yu wrote:
>>>
>>>
>>> On Oct 13, 2015 2:47 PM, "Grey Marsh" <[email protected]> wrote:
>>>
>>> Which julia version are you using. There's sime gc tweak on 0.4 for that.
>>>
>>> >
>>> > I was trying to load the training dataset from springleaf marketing
>>> response on Kaggle. The csv is 921 mb, has 145321 row and 1934 columns. My
>>> machine has 8 gb ram and julia ate 5.8gb+ memory after that I stopped julia
>>> as there was barely any memory left for OS to function properly. It took
>>> about 5-6 minutes later for the incomplete operation. I've windows 8
>>> 64bit. Used the following code to read the csv to Julia:
>>> >
>>> > using DataFrames
>>> > train = readtable("C:\\train.csv")
>>> >
>>> > Next I tried to to load the same file in python:
>>> >
>>> > import pandas as pd
>>> > train = pd.read_csv("C:\\train.csv")
>>> >
>>> > This took ~2.4gb memory, about a minute time
>>> >
>>> > Checking the same in R again:
>>> > df = read.csv('E:/Libraries/train.csv', as.is = T)
>>> >
>>> > This took 2-3 minutes and consumes 3.5gb mem on the same machine.
>>> >
>>> > Why such discrepancy and why Julia even fails to load the csv before
>>> running out of memory? If there is any better way to get the file loaded in
>>> Julia?
>>> >
>>> >
>>>
>>

Re: [julia-users] 900mb csv loading in Julia failed: memory comparison vs python pandas and R

Reply via email to