Re: [julia-users] 900mb csv loading in Julia failed: memory comparison vs python pandas and R

bernhard Mon, 26 Oct 2015 23:11:13 -0700

Thanks. I appreciate your efforts.
Looking forward to 0.4.1. in that case.


Am Dienstag, 27. Oktober 2015 06:30:32 UTC+1 schrieb Jacob Quinn:
>
> Just a quick follow-up here: after some benchmarking of my own on a 
> windows machine, the culprit ended up being a deathly slow `strtod` system 
> library function on windows. It takes a few hoops to get the performance 
> right, which I discovered is already done in Base Julia, it just wasn't 
> exported.
>
> My PR to Base Julia <https://github.com/JuliaLang/julia/pull/13641> has 
> been accepted and is backport pending, so once Julia 0.4.1 is released, 
> CSV.jl will be updated to use the new code and will require that version of 
> Julia to enable similar great performance cross-platform.
>
> -Jacob
>
> On Wed, Oct 14, 2015 at 3:51 AM, bernhard <[email protected] <javascript:>
> > wrote:
>
>> with readtable the julia process goes up to 6.3 GB and stays there. It 
>> takes 95 seconds. (@time shows "373M, allocations: 13GB, 7% GC time")
>> I will try Jacob's approach again.
>>
>>
>> Am Mittwoch, 14. Oktober 2015 10:59:06 UTC+2 schrieb Milan Bouchet-Valat:
>>>
>>> Le mercredi 14 octobre 2015 à 00:15 -0700, Grey Marsh a écrit : 
>>> > Done with the testing in the cloud instance. 
>>> > It works and the timings in my case 
>>> > 
>>> > 58.346345 seconds (694.00 M allocations: 12.775 GB, 2.63% gc time) 
>>> > 
>>> > result of "top" command:  VIRT: 11.651g RES: 3.579g 
>>> > 
>>> > ~13gb memory for a 900mb file! 
>>> > Thanks to Jacob atleast I was able check that the process works. 
>>> As Yichao noted, at no point in the import did Julia use 13GB of RAM. 
>>> That's the total amount of memory that was allocated and freed by 
>>> pieces (694M of them). You'd need to watch the Julia process while 
>>> working to see what's the maximum value of RES when importing. 
>>>
>>>
>>> Regards 
>>>
>>> > On Wednesday, October 14, 2015 at 12:10:02 PM UTC+5:30, bernhard 
>>> > wrote: 
>>> > > Jacob 
>>> > > 
>>> > > I do run into the same issue as Grey. the step 
>>> > > ds = DataStreams.DataTable(f); 
>>> > > gets stuck. 
>>> > > I also tried this with a smaller file (150MB) which I have. This 
>>> > > file is read by readtable in 15s. But the DataTable function 
>>> > > freezes. I use 0.4 on Windows 7. 
>>> > > 
>>> > > I note that your code did work on a tiny file though (40 lines or 
>>> > > so). 
>>> > > I do get a dataframe, but when I show it (by simply typing df, or 
>>> > > dump(df)) Julia crashes... 
>>> > > 
>>> > > Bernhard 
>>> > > 
>>> > > 
>>> > > Am Mittwoch, 14. Oktober 2015 06:54:16 UTC+2 schrieb Grey Marsh: 
>>> > > > I am using Julia 0.4 for this purpose, if that's what is meant by 
>>> > > > "0.4 only". 
>>> > > > 
>>> > > > On Wednesday, October 14, 2015 at 9:53:09 AM UTC+5:30, Jacob 
>>> > > > Quinn wrote: 
>>> > > > > Oh yes, I forgot to mention that the CSV/DataStreams code is 
>>> > > > > 0.4 only. Definitely interested to hear about any 
>>> > > > > results/experiences though. 
>>> > > > > 
>>> > > > > -Jacob 
>>> > > > > 
>>> > > > > On Tue, Oct 13, 2015 at 10:11 PM, Yichao Yu <[email protected]> 
>>> > > > > wrote: 
>>> > > > > > On Wed, Oct 14, 2015 at 12:02 AM, Grey Marsh < 
>>> > > > > > [email protected]> wrote: 
>>> > > > > > > @Jacob, I tried your approach. Somehow it got stuck in the 
>>> > > > > > "@time ds = 
>>> > > > > > > DataStreams.DataTable(f)" line. After 15 minutes running, 
>>> > > > > > julia is using 
>>> > > > > > > ~500mb and 1 cpu core with no sign of end. The memory use 
>>> > > > > > has been almost 
>>> > > > > > > same for the whole duration of 15 minutes. I'm letting it 
>>> > > > > > run, hoping that 
>>> > > > > > > it finishes after some time. 
>>> > > > > > > 
>>> > > > > > > From your run, I can see it needs 12gb memory which is 
>>> > > > > > higher than my 
>>> > > > > > > machine memory of 8gb. could it be the problem? 
>>> > > > > > 
>>> > > > > > 12GB is the total number of memory ever allocated during the 
>>> > > > > > timing. A 
>>> > > > > > lot of them might be intermediate results that are freed by 
>>> > > > > > the GC. 
>>> > > > > > Also, from the output of @time, it looks like 0.4. 
>>> > > > > > 
>>> > > > > > > 
>>> > > > > > > On Wednesday, October 14, 2015 at 2:28:09 AM UTC+5:30, 
>>> > > > > > Jacob Quinn wrote: 
>>> > > > > > >> 
>>> > > > > > >> I'm hesitant to suggest, but if you're in a bind, I have 
>>> > > > > > an experimental 
>>> > > > > > >> package for fast CSV reading. The API has stabilized 
>>> > > > > > somewhat over the last 
>>> > > > > > >> week and I'm planning a more broad release soon, but I'd 
>>> > > > > > still consider it 
>>> > > > > > >> alpha mode. That said, if anyone's willing to give it a 
>>> > > > > > drive, you just need 
>>> > > > > > >> to 
>>> > > > > > >> 
>>> > > > > > >> Pkg.add("Libz") 
>>> > > > > > >> Pkg.add("NullableArrays") 
>>> > > > > > >> Pkg.clone("https://github.com/quinnj/DataStreams.jl";) 
>>> > > > > > >> Pkg.clone("https://github.com/quinnj/CSV.jl";) 
>>> > > > > > >> 
>>> > > > > > >> With the original file referenced here I get: 
>>> > > > > > >> 
>>> > > > > > >> julia> reload("CSV") 
>>> > > > > > >> 
>>> > > > > > >> julia> f = 
>>> > > > > > CSV.Source("/Users/jacobquinn/Downloads/train.csv";null="NA") 
>>> > > > > > >> CSV.Source: "/Users/jacobquinn/Downloads/train.csv" 
>>> > > > > > >> delim: ',' 
>>> > > > > > >> quotechar: '"' 
>>> > > > > > >> escapechar: '\\' 
>>> > > > > > >> null: "NA" 
>>> > > > > > >> schema: 
>>> > > > > > >> 
>>> > > > > > DataStreams.Schema(UTF8String["ID","VAR_0001","VAR_0002","VAR 
>>> > > > > > _0003","VAR_0004","VAR_0005","VAR_0006","VAR_0007","VAR_0008" 
>>> > > > > > ,"VAR_0009" 
>>> > > > > > >> … 
>>> > > > > > >> 
>>> > > > > > "VAR_1926","VAR_1927","VAR_1928","VAR_1929","VAR_1930","VAR_1 
>>> > > > > > 931","VAR_1932","VAR_1933","VAR_1934","target"],[Int64,DataSt 
>>> > > > > > reams.PointerString,Int64,Int64,Int64,DataStreams.PointerStri 
>>> > > > > > ng,Int64,Int64,DataStreams.PointerString,DataStreams.PointerS 
>>> > > > > > tring 
>>> > > > > > >> … 
>>> > > > > > >> 
>>> > > > > > Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,DataStreams.P 
>>> > > > > > ointerString,Int64],145231,1934) 
>>> > > > > > >> dateformat: 
>>> > > > > > Base.Dates.DateFormat(Base.Dates.Slot[],"","english") 
>>> > > > > > >> 
>>> > > > > > >> 
>>> > > > > > >> julia> @time ds = DataStreams.DataTable(f) 
>>> > > > > > >>  43.513800 seconds (694.00 M allocations: 12.775 GB, 2.55% 
>>> > > > > > gc time) 
>>> > > > > > >> 
>>> > > > > > >> 
>>> > > > > > >> You can convert the result to a DataFrame with: 
>>> > > > > > >> 
>>> > > > > > >> function DataFrames.DataFrame(dt::DataStreams.DataTable) 
>>> > > > > > >>     cols = dt.schema.cols 
>>> > > > > > >>     data = Array(Any,cols) 
>>> > > > > > >>     types = DataStreams.types(dt) 
>>> > > > > > >>     for i = 1:cols 
>>> > > > > > >>         data[i] = DataStreams.column(dt,i,types[i]) 
>>> > > > > > >>     end 
>>> > > > > > >>     return DataFrame(data,Symbol[symbol(x) for x in 
>>> > > > > > dt.schema.header]) 
>>> > > > > > >> end 
>>> > > > > > >> 
>>> > > > > > >> 
>>> > > > > > >> -Jacob 
>>> > > > > > >> 
>>> > > > > > >> On Tue, Oct 13, 2015 at 2:40 PM, feza <[email protected]> 
>>> > > > > > wrote: 
>>> > > > > > >>> 
>>> > > > > > >>> Finally was able to load it, but the process   consumes a 
>>> > > > > > ton of memory. 
>>> > > > > > >>> julia> @time train = readtable("./test.csv"); 
>>> > > > > > >>> 124.575362 seconds (376.11 M allocations: 13.438 GB, 
>>> > > > > > 10.77% gc time) 
>>> > > > > > >>> 
>>> > > > > > >>> 
>>> > > > > > >>> 
>>> > > > > > >>> On Tuesday, October 13, 2015 at 4:34:05 PM UTC-4, feza 
>>> > > > > > wrote: 
>>> > > > > > >>>> 
>>> > > > > > >>>> Same here on a 12gb ram machine 
>>> > > > > > >>>> 
>>> > > > > > >>>>                _ 
>>> > > > > > >>>>    _       _ _(_)_     |  A fresh approach to technical 
>>> > > > > > computing 
>>> > > > > > >>>>   (_)     | (_) (_)    |  Documentation: 
>>> > > > > > http://docs.julialang.org 
>>> > > > > > >>>>    _ _   _| |_  __ _   |  Type "?help" for help. 
>>> > > > > > >>>>   | | | | | | |/ _` |  | 
>>> > > > > > >>>>   | | |_| | | | (_| |  |  Version 0.5.0-dev+429 (2015-09 
>>> > > > > > -29 09:47 UTC) 
>>> > > > > > >>>>  _/ |\__'_|_|_|\__'_|  |  Commit f71e449 (14 days old 
>>> > > > > > master) 
>>> > > > > > >>>> |__/                   |  x86_64-w64-mingw32 
>>> > > > > > >>>> 
>>> > > > > > >>>> julia> using DataFrames 
>>> > > > > > >>>> 
>>> > > > > > >>>> julia> train = readtable("./test.csv"); 
>>> > > > > > >>>> ERROR: OutOfMemoryError() 
>>> > > > > > >>>>  in resize! at array.jl:452 
>>> > > > > > >>>>  in readnrows! at 
>>> > > > > > >>>> 
>>> > > > > > C:\Users\Mustafa\.julia\v0.5\DataFrames\src\dataframe\io.jl:1 
>>> > > > > > 64 
>>> > > > > > >>>>  in readtable! at 
>>> > > > > > >>>> 
>>> > > > > > C:\Users\Mustafa\.julia\v0.5\DataFrames\src\dataframe\io.jl:7 
>>> > > > > > 67 
>>> > > > > > >>>>  in readtable at 
>>> > > > > > >>>> 
>>> > > > > > C:\Users\Mustafa\.julia\v0.5\DataFrames\src\dataframe\io.jl:8 
>>> > > > > > 47 
>>> > > > > > >>>>  in readtable at 
>>> > > > > > >>>> 
>>> > > > > > C:\Users\Mustafa\.julia\v0.5\DataFrames\src\dataframe\io.jl:8 
>>> > > > > > 93 
>>> > > > > > >>>> 
>>> > > > > > >>>> 
>>> > > > > > >>>> 
>>> > > > > > >>>> 
>>> > > > > > >>>> 
>>> > > > > > >>>> On Tuesday, October 13, 2015 at 3:47:58 PM UTC-4, Yichao 
>>> > > > > > Yu wrote: 
>>> > > > > > >>>>> 
>>> > > > > > >>>>> 
>>> > > > > > >>>>> On Oct 13, 2015 2:47 PM, "Grey Marsh" < 
>>> > > > > > [email protected]> wrote: 
>>> > > > > > >>>>> 
>>> > > > > > >>>>> Which julia version are you using. There's sime gc 
>>> > > > > > tweak on 0.4 for 
>>> > > > > > >>>>> that. 
>>> > > > > > >>>>> 
>>> > > > > > >>>>> > 
>>> > > > > > >>>>> > I was trying to load the training dataset from 
>>> > > > > > springleaf marketing 
>>> > > > > > >>>>> > response on Kaggle. The csv is 921 mb, has 145321 row 
>>> > > > > > and 1934 columns. My 
>>> > > > > > >>>>> > machine has 8 gb ram and julia ate 5.8gb+ memory 
>>> > > > > > after that I stopped julia 
>>> > > > > > >>>>> > as there was barely any memory left for OS to 
>>> > > > > > function properly. It took 
>>> > > > > > >>>>> > about 5-6 minutes later for the incomplete operation. 
>>> > > > > > I've windows 8  64bit. 
>>> > > > > > >>>>> > Used the following code to read the csv to Julia: 
>>> > > > > > >>>>> > 
>>> > > > > > >>>>> > using DataFrames 
>>> > > > > > >>>>> > train = readtable("C:\\train.csv") 
>>> > > > > > >>>>> > 
>>> > > > > > >>>>> > Next I tried to to load the same file in python: 
>>> > > > > > >>>>> > 
>>> > > > > > >>>>> > import pandas as pd 
>>> > > > > > >>>>> > train = pd.read_csv("C:\\train.csv") 
>>> > > > > > >>>>> > 
>>> > > > > > >>>>> > This took ~2.4gb memory, about a minute time 
>>> > > > > > >>>>> > 
>>> > > > > > >>>>> > Checking the same in R again: 
>>> > > > > > >>>>> > df = read.csv('E:/Libraries/train.csv', as.is = T) 
>>> > > > > > >>>>> > 
>>> > > > > > >>>>> > This took 2-3 minutes and consumes 3.5gb mem on the 
>>> > > > > > same machine. 
>>> > > > > > >>>>> > 
>>> > > > > > >>>>> > Why such discrepancy and why Julia even fails to load 
>>> > > > > > the csv before 
>>> > > > > > >>>>> > running out of memory? If there is any better way to 
>>> > > > > > get the file loaded in 
>>> > > > > > >>>>> > Julia? 
>>> > > > > > >>>>> > 
>>> > > > > > >>>>> > 
>>> > > > > > >> 
>>> > > > > > >> 
>>> > > > > > > 
>>> > > > > > 
>>> > > > > 
>>>
>>
>

Re: [julia-users] 900mb csv loading in Julia failed: memory comparison vs python pandas and R

Reply via email to