Re: [julia-users] 900mb csv loading in Julia failed: memory comparison vs python pandas and R

Grey Marsh Wed, 14 Oct 2015 00:15:34 -0700

Done with the testing in the cloud instance.
It works and the timings in my case


58.346345 seconds (694.00 M allocations: 12.775 GB, 2.63% gc time)

result of "*top*" command:  VIRT: 11.651g RES: 3.579g

~13gb memory for a 900mb file!
Thanks to Jacob atleast I was able check that the process works.


On Wednesday, October 14, 2015 at 12:10:02 PM UTC+5:30, bernhard wrote:
>
> Jacob
>
> I do run into the same issue as Grey. the step
> ds = DataStreams.DataTable(f);
> gets stuck.
> I also tried this with a smaller file (150MB) which I have. This file is 
> read by readtable in 15s. But the DataTable function freezes. I use 0.4 on 
> Windows 7.
>
> I note that your code did work on a tiny file though (40 lines or so).
> I do get a dataframe, but when I show it (by simply typing df, or 
> dump(df)) Julia crashes...
>
> Bernhard
>
>
> Am Mittwoch, 14. Oktober 2015 06:54:16 UTC+2 schrieb Grey Marsh:
>>
>> I am using Julia 0.4 for this purpose, if that's what is meant by "0.4 
>> only". 
>>
>> On Wednesday, October 14, 2015 at 9:53:09 AM UTC+5:30, Jacob Quinn wrote:
>>>
>>> Oh yes, I forgot to mention that the CSV/DataStreams code is 0.4 only. 
>>> Definitely interested to hear about any results/experiences though.
>>>
>>> -Jacob
>>>
>>> On Tue, Oct 13, 2015 at 10:11 PM, Yichao Yu <[email protected]> wrote:
>>>
>>>> On Wed, Oct 14, 2015 at 12:02 AM, Grey Marsh <[email protected]> wrote:
>>>> > @Jacob, I tried your approach. Somehow it got stuck in the "@time ds =
>>>> > DataStreams.DataTable(f)" line. After 15 minutes running, julia is 
>>>> using
>>>> > ~500mb and 1 cpu core with no sign of end. The memory use has been 
>>>> almost
>>>> > same for the whole duration of 15 minutes. I'm letting it run, hoping 
>>>> that
>>>> > it finishes after some time.
>>>> >
>>>> > From your run, I can see it needs 12gb memory which is higher than my
>>>> > machine memory of 8gb. could it be the problem?
>>>>
>>>> 12GB is the total number of memory ever allocated during the timing. A
>>>> lot of them might be intermediate results that are freed by the GC.
>>>> Also, from the output of @time, it looks like 0.4.
>>>>
>>>> >
>>>> > On Wednesday, October 14, 2015 at 2:28:09 AM UTC+5:30, Jacob Quinn 
>>>> wrote:
>>>> >>
>>>> >> I'm hesitant to suggest, but if you're in a bind, I have an 
>>>> experimental
>>>> >> package for fast CSV reading. The API has stabilized somewhat over 
>>>> the last
>>>> >> week and I'm planning a more broad release soon, but I'd still 
>>>> consider it
>>>> >> alpha mode. That said, if anyone's willing to give it a drive, you 
>>>> just need
>>>> >> to
>>>> >>
>>>> >> Pkg.add("Libz")
>>>> >> Pkg.add("NullableArrays")
>>>> >> Pkg.clone("https://github.com/quinnj/DataStreams.jl";)
>>>> >> Pkg.clone("https://github.com/quinnj/CSV.jl";)
>>>> >>
>>>> >> With the original file referenced here I get:
>>>> >>
>>>> >> julia> reload("CSV")
>>>> >>
>>>> >> julia> f = 
>>>> CSV.Source("/Users/jacobquinn/Downloads/train.csv";null="NA")
>>>> >> CSV.Source: "/Users/jacobquinn/Downloads/train.csv"
>>>> >> delim: ','
>>>> >> quotechar: '"'
>>>> >> escapechar: '\\'
>>>> >> null: "NA"
>>>> >> schema:
>>>> >> 
>>>> DataStreams.Schema(UTF8String["ID","VAR_0001","VAR_0002","VAR_0003","VAR_0004","VAR_0005","VAR_0006","VAR_0007","VAR_0008","VAR_0009"
>>>> >> …
>>>> >> 
>>>> "VAR_1926","VAR_1927","VAR_1928","VAR_1929","VAR_1930","VAR_1931","VAR_1932","VAR_1933","VAR_1934","target"],[Int64,DataStreams.PointerString,Int64,Int64,Int64,DataStreams.PointerString,Int64,Int64,DataStreams.PointerString,DataStreams.PointerString
>>>> >> …
>>>> >> 
>>>> Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,DataStreams.PointerString,Int64],145231,1934)
>>>> >> dateformat: Base.Dates.DateFormat(Base.Dates.Slot[],"","english")
>>>> >>
>>>> >>
>>>> >> julia> @time ds = DataStreams.DataTable(f)
>>>> >>  43.513800 seconds (694.00 M allocations: 12.775 GB, 2.55% gc time)
>>>> >>
>>>> >>
>>>> >> You can convert the result to a DataFrame with:
>>>> >>
>>>> >> function DataFrames.DataFrame(dt::DataStreams.DataTable)
>>>> >>     cols = dt.schema.cols
>>>> >>     data = Array(Any,cols)
>>>> >>     types = DataStreams.types(dt)
>>>> >>     for i = 1:cols
>>>> >>         data[i] = DataStreams.column(dt,i,types[i])
>>>> >>     end
>>>> >>     return DataFrame(data,Symbol[symbol(x) for x in 
>>>> dt.schema.header])
>>>> >> end
>>>> >>
>>>> >>
>>>> >> -Jacob
>>>> >>
>>>> >> On Tue, Oct 13, 2015 at 2:40 PM, feza <[email protected]> wrote:
>>>> >>>
>>>> >>> Finally was able to load it, but the process   consumes a ton of 
>>>> memory.
>>>> >>> julia> @time train = readtable("./test.csv");
>>>> >>> 124.575362 seconds (376.11 M allocations: 13.438 GB, 10.77% gc time)
>>>> >>>
>>>> >>>
>>>> >>>
>>>> >>> On Tuesday, October 13, 2015 at 4:34:05 PM UTC-4, feza wrote:
>>>> >>>>
>>>> >>>> Same here on a 12gb ram machine
>>>> >>>>
>>>> >>>>                _
>>>> >>>>    _       _ _(_)_     |  A fresh approach to technical computing
>>>> >>>>   (_)     | (_) (_)    |  Documentation: http://docs.julialang.org
>>>> >>>>    _ _   _| |_  __ _   |  Type "?help" for help.
>>>> >>>>   | | | | | | |/ _` |  |
>>>> >>>>   | | |_| | | | (_| |  |  Version 0.5.0-dev+429 (2015-09-29 09:47 
>>>> UTC)
>>>> >>>>  _/ |\__'_|_|_|\__'_|  |  Commit f71e449 (14 days old master)
>>>> >>>> |__/                   |  x86_64-w64-mingw32
>>>> >>>>
>>>> >>>> julia> using DataFrames
>>>> >>>>
>>>> >>>> julia> train = readtable("./test.csv");
>>>> >>>> ERROR: OutOfMemoryError()
>>>> >>>>  in resize! at array.jl:452
>>>> >>>>  in readnrows! at
>>>> >>>> C:\Users\Mustafa\.julia\v0.5\DataFrames\src\dataframe\io.jl:164
>>>> >>>>  in readtable! at
>>>> >>>> C:\Users\Mustafa\.julia\v0.5\DataFrames\src\dataframe\io.jl:767
>>>> >>>>  in readtable at
>>>> >>>> C:\Users\Mustafa\.julia\v0.5\DataFrames\src\dataframe\io.jl:847
>>>> >>>>  in readtable at
>>>> >>>> C:\Users\Mustafa\.julia\v0.5\DataFrames\src\dataframe\io.jl:893
>>>> >>>>
>>>> >>>>
>>>> >>>>
>>>> >>>>
>>>> >>>>
>>>> >>>> On Tuesday, October 13, 2015 at 3:47:58 PM UTC-4, Yichao Yu wrote:
>>>> >>>>>
>>>> >>>>>
>>>> >>>>> On Oct 13, 2015 2:47 PM, "Grey Marsh" <[email protected]> wrote:
>>>> >>>>>
>>>> >>>>> Which julia version are you using. There's sime gc tweak on 0.4 
>>>> for
>>>> >>>>> that.
>>>> >>>>>
>>>> >>>>> >
>>>> >>>>> > I was trying to load the training dataset from springleaf 
>>>> marketing
>>>> >>>>> > response on Kaggle. The csv is 921 mb, has 145321 row and 1934 
>>>> columns. My
>>>> >>>>> > machine has 8 gb ram and julia ate 5.8gb+ memory after that I 
>>>> stopped julia
>>>> >>>>> > as there was barely any memory left for OS to function 
>>>> properly. It took
>>>> >>>>> > about 5-6 minutes later for the incomplete operation. I've 
>>>> windows 8  64bit.
>>>> >>>>> > Used the following code to read the csv to Julia:
>>>> >>>>> >
>>>> >>>>> > using DataFrames
>>>> >>>>> > train = readtable("C:\\train.csv")
>>>> >>>>> >
>>>> >>>>> > Next I tried to to load the same file in python:
>>>> >>>>> >
>>>> >>>>> > import pandas as pd
>>>> >>>>> > train = pd.read_csv("C:\\train.csv")
>>>> >>>>> >
>>>> >>>>> > This took ~2.4gb memory, about a minute time
>>>> >>>>> >
>>>> >>>>> > Checking the same in R again:
>>>> >>>>> > df = read.csv('E:/Libraries/train.csv', as.is = T)
>>>> >>>>> >
>>>> >>>>> > This took 2-3 minutes and consumes 3.5gb mem on the same 
>>>> machine.
>>>> >>>>> >
>>>> >>>>> > Why such discrepancy and why Julia even fails to load the csv 
>>>> before
>>>> >>>>> > running out of memory? If there is any better way to get the 
>>>> file loaded in
>>>> >>>>> > Julia?
>>>> >>>>> >
>>>> >>>>> >
>>>> >>
>>>> >>
>>>> >
>>>>
>>>
>>>

Re: [julia-users] 900mb csv loading in Julia failed: memory comparison vs python pandas and R

Reply via email to