On Wed, Oct 14, 2015 at 12:02 AM, Grey Marsh <[email protected]> wrote:
> @Jacob, I tried your approach. Somehow it got stuck in the "@time ds =
> DataStreams.DataTable(f)" line. After 15 minutes running, julia is using
> ~500mb and 1 cpu core with no sign of end. The memory use has been almost
> same for the whole duration of 15 minutes. I'm letting it run, hoping that
> it finishes after some time.
>
> From your run, I can see it needs 12gb memory which is higher than my
> machine memory of 8gb. could it be the problem?
12GB is the total number of memory ever allocated during the timing. A
lot of them might be intermediate results that are freed by the GC.
Also, from the output of @time, it looks like 0.4.
>
> On Wednesday, October 14, 2015 at 2:28:09 AM UTC+5:30, Jacob Quinn wrote:
>>
>> I'm hesitant to suggest, but if you're in a bind, I have an experimental
>> package for fast CSV reading. The API has stabilized somewhat over the last
>> week and I'm planning a more broad release soon, but I'd still consider it
>> alpha mode. That said, if anyone's willing to give it a drive, you just need
>> to
>>
>> Pkg.add("Libz")
>> Pkg.add("NullableArrays")
>> Pkg.clone("https://github.com/quinnj/DataStreams.jl")
>> Pkg.clone("https://github.com/quinnj/CSV.jl")
>>
>> With the original file referenced here I get:
>>
>> julia> reload("CSV")
>>
>> julia> f = CSV.Source("/Users/jacobquinn/Downloads/train.csv";null="NA")
>> CSV.Source: "/Users/jacobquinn/Downloads/train.csv"
>> delim: ','
>> quotechar: '"'
>> escapechar: '\\'
>> null: "NA"
>> schema:
>> DataStreams.Schema(UTF8String["ID","VAR_0001","VAR_0002","VAR_0003","VAR_0004","VAR_0005","VAR_0006","VAR_0007","VAR_0008","VAR_0009"
>> …
>> "VAR_1926","VAR_1927","VAR_1928","VAR_1929","VAR_1930","VAR_1931","VAR_1932","VAR_1933","VAR_1934","target"],[Int64,DataStreams.PointerString,Int64,Int64,Int64,DataStreams.PointerString,Int64,Int64,DataStreams.PointerString,DataStreams.PointerString
>> …
>> Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,DataStreams.PointerString,Int64],145231,1934)
>> dateformat: Base.Dates.DateFormat(Base.Dates.Slot[],"","english")
>>
>>
>> julia> @time ds = DataStreams.DataTable(f)
>> 43.513800 seconds (694.00 M allocations: 12.775 GB, 2.55% gc time)
>>
>>
>> You can convert the result to a DataFrame with:
>>
>> function DataFrames.DataFrame(dt::DataStreams.DataTable)
>> cols = dt.schema.cols
>> data = Array(Any,cols)
>> types = DataStreams.types(dt)
>> for i = 1:cols
>> data[i] = DataStreams.column(dt,i,types[i])
>> end
>> return DataFrame(data,Symbol[symbol(x) for x in dt.schema.header])
>> end
>>
>>
>> -Jacob
>>
>> On Tue, Oct 13, 2015 at 2:40 PM, feza <[email protected]> wrote:
>>>
>>> Finally was able to load it, but the process consumes a ton of memory.
>>> julia> @time train = readtable("./test.csv");
>>> 124.575362 seconds (376.11 M allocations: 13.438 GB, 10.77% gc time)
>>>
>>>
>>>
>>> On Tuesday, October 13, 2015 at 4:34:05 PM UTC-4, feza wrote:
>>>>
>>>> Same here on a 12gb ram machine
>>>>
>>>> _
>>>> _ _ _(_)_ | A fresh approach to technical computing
>>>> (_) | (_) (_) | Documentation: http://docs.julialang.org
>>>> _ _ _| |_ __ _ | Type "?help" for help.
>>>> | | | | | | |/ _` | |
>>>> | | |_| | | | (_| | | Version 0.5.0-dev+429 (2015-09-29 09:47 UTC)
>>>> _/ |\__'_|_|_|\__'_| | Commit f71e449 (14 days old master)
>>>> |__/ | x86_64-w64-mingw32
>>>>
>>>> julia> using DataFrames
>>>>
>>>> julia> train = readtable("./test.csv");
>>>> ERROR: OutOfMemoryError()
>>>> in resize! at array.jl:452
>>>> in readnrows! at
>>>> C:\Users\Mustafa\.julia\v0.5\DataFrames\src\dataframe\io.jl:164
>>>> in readtable! at
>>>> C:\Users\Mustafa\.julia\v0.5\DataFrames\src\dataframe\io.jl:767
>>>> in readtable at
>>>> C:\Users\Mustafa\.julia\v0.5\DataFrames\src\dataframe\io.jl:847
>>>> in readtable at
>>>> C:\Users\Mustafa\.julia\v0.5\DataFrames\src\dataframe\io.jl:893
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Tuesday, October 13, 2015 at 3:47:58 PM UTC-4, Yichao Yu wrote:
>>>>>
>>>>>
>>>>> On Oct 13, 2015 2:47 PM, "Grey Marsh" <[email protected]> wrote:
>>>>>
>>>>> Which julia version are you using. There's sime gc tweak on 0.4 for
>>>>> that.
>>>>>
>>>>> >
>>>>> > I was trying to load the training dataset from springleaf marketing
>>>>> > response on Kaggle. The csv is 921 mb, has 145321 row and 1934 columns.
>>>>> > My
>>>>> > machine has 8 gb ram and julia ate 5.8gb+ memory after that I stopped
>>>>> > julia
>>>>> > as there was barely any memory left for OS to function properly. It took
>>>>> > about 5-6 minutes later for the incomplete operation. I've windows 8
>>>>> > 64bit.
>>>>> > Used the following code to read the csv to Julia:
>>>>> >
>>>>> > using DataFrames
>>>>> > train = readtable("C:\\train.csv")
>>>>> >
>>>>> > Next I tried to to load the same file in python:
>>>>> >
>>>>> > import pandas as pd
>>>>> > train = pd.read_csv("C:\\train.csv")
>>>>> >
>>>>> > This took ~2.4gb memory, about a minute time
>>>>> >
>>>>> > Checking the same in R again:
>>>>> > df = read.csv('E:/Libraries/train.csv', as.is = T)
>>>>> >
>>>>> > This took 2-3 minutes and consumes 3.5gb mem on the same machine.
>>>>> >
>>>>> > Why such discrepancy and why Julia even fails to load the csv before
>>>>> > running out of memory? If there is any better way to get the file
>>>>> > loaded in
>>>>> > Julia?
>>>>> >
>>>>> >
>>
>>
>