Done with the testing in the cloud instance. It works and the timings in my case
58.346345 seconds (694.00 M allocations: 12.775 GB, 2.63% gc time) result of "*top*" command: VIRT: 11.651g RES: 3.579g ~13gb memory for a 900mb file! Thanks to Jacob atleast I was able check that the process works. On Wednesday, October 14, 2015 at 12:10:02 PM UTC+5:30, bernhard wrote: > > Jacob > > I do run into the same issue as Grey. the step > ds = DataStreams.DataTable(f); > gets stuck. > I also tried this with a smaller file (150MB) which I have. This file is > read by readtable in 15s. But the DataTable function freezes. I use 0.4 on > Windows 7. > > I note that your code did work on a tiny file though (40 lines or so). > I do get a dataframe, but when I show it (by simply typing df, or > dump(df)) Julia crashes... > > Bernhard > > > Am Mittwoch, 14. Oktober 2015 06:54:16 UTC+2 schrieb Grey Marsh: >> >> I am using Julia 0.4 for this purpose, if that's what is meant by "0.4 >> only". >> >> On Wednesday, October 14, 2015 at 9:53:09 AM UTC+5:30, Jacob Quinn wrote: >>> >>> Oh yes, I forgot to mention that the CSV/DataStreams code is 0.4 only. >>> Definitely interested to hear about any results/experiences though. >>> >>> -Jacob >>> >>> On Tue, Oct 13, 2015 at 10:11 PM, Yichao Yu <[email protected]> wrote: >>> >>>> On Wed, Oct 14, 2015 at 12:02 AM, Grey Marsh <[email protected]> wrote: >>>> > @Jacob, I tried your approach. Somehow it got stuck in the "@time ds = >>>> > DataStreams.DataTable(f)" line. After 15 minutes running, julia is >>>> using >>>> > ~500mb and 1 cpu core with no sign of end. The memory use has been >>>> almost >>>> > same for the whole duration of 15 minutes. I'm letting it run, hoping >>>> that >>>> > it finishes after some time. >>>> > >>>> > From your run, I can see it needs 12gb memory which is higher than my >>>> > machine memory of 8gb. could it be the problem? >>>> >>>> 12GB is the total number of memory ever allocated during the timing. A >>>> lot of them might be intermediate results that are freed by the GC. >>>> Also, from the output of @time, it looks like 0.4. >>>> >>>> > >>>> > On Wednesday, October 14, 2015 at 2:28:09 AM UTC+5:30, Jacob Quinn >>>> wrote: >>>> >> >>>> >> I'm hesitant to suggest, but if you're in a bind, I have an >>>> experimental >>>> >> package for fast CSV reading. The API has stabilized somewhat over >>>> the last >>>> >> week and I'm planning a more broad release soon, but I'd still >>>> consider it >>>> >> alpha mode. That said, if anyone's willing to give it a drive, you >>>> just need >>>> >> to >>>> >> >>>> >> Pkg.add("Libz") >>>> >> Pkg.add("NullableArrays") >>>> >> Pkg.clone("https://github.com/quinnj/DataStreams.jl") >>>> >> Pkg.clone("https://github.com/quinnj/CSV.jl") >>>> >> >>>> >> With the original file referenced here I get: >>>> >> >>>> >> julia> reload("CSV") >>>> >> >>>> >> julia> f = >>>> CSV.Source("/Users/jacobquinn/Downloads/train.csv";null="NA") >>>> >> CSV.Source: "/Users/jacobquinn/Downloads/train.csv" >>>> >> delim: ',' >>>> >> quotechar: '"' >>>> >> escapechar: '\\' >>>> >> null: "NA" >>>> >> schema: >>>> >> >>>> DataStreams.Schema(UTF8String["ID","VAR_0001","VAR_0002","VAR_0003","VAR_0004","VAR_0005","VAR_0006","VAR_0007","VAR_0008","VAR_0009" >>>> >> … >>>> >> >>>> "VAR_1926","VAR_1927","VAR_1928","VAR_1929","VAR_1930","VAR_1931","VAR_1932","VAR_1933","VAR_1934","target"],[Int64,DataStreams.PointerString,Int64,Int64,Int64,DataStreams.PointerString,Int64,Int64,DataStreams.PointerString,DataStreams.PointerString >>>> >> … >>>> >> >>>> Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,DataStreams.PointerString,Int64],145231,1934) >>>> >> dateformat: Base.Dates.DateFormat(Base.Dates.Slot[],"","english") >>>> >> >>>> >> >>>> >> julia> @time ds = DataStreams.DataTable(f) >>>> >> 43.513800 seconds (694.00 M allocations: 12.775 GB, 2.55% gc time) >>>> >> >>>> >> >>>> >> You can convert the result to a DataFrame with: >>>> >> >>>> >> function DataFrames.DataFrame(dt::DataStreams.DataTable) >>>> >> cols = dt.schema.cols >>>> >> data = Array(Any,cols) >>>> >> types = DataStreams.types(dt) >>>> >> for i = 1:cols >>>> >> data[i] = DataStreams.column(dt,i,types[i]) >>>> >> end >>>> >> return DataFrame(data,Symbol[symbol(x) for x in >>>> dt.schema.header]) >>>> >> end >>>> >> >>>> >> >>>> >> -Jacob >>>> >> >>>> >> On Tue, Oct 13, 2015 at 2:40 PM, feza <[email protected]> wrote: >>>> >>> >>>> >>> Finally was able to load it, but the process consumes a ton of >>>> memory. >>>> >>> julia> @time train = readtable("./test.csv"); >>>> >>> 124.575362 seconds (376.11 M allocations: 13.438 GB, 10.77% gc time) >>>> >>> >>>> >>> >>>> >>> >>>> >>> On Tuesday, October 13, 2015 at 4:34:05 PM UTC-4, feza wrote: >>>> >>>> >>>> >>>> Same here on a 12gb ram machine >>>> >>>> >>>> >>>> _ >>>> >>>> _ _ _(_)_ | A fresh approach to technical computing >>>> >>>> (_) | (_) (_) | Documentation: http://docs.julialang.org >>>> >>>> _ _ _| |_ __ _ | Type "?help" for help. >>>> >>>> | | | | | | |/ _` | | >>>> >>>> | | |_| | | | (_| | | Version 0.5.0-dev+429 (2015-09-29 09:47 >>>> UTC) >>>> >>>> _/ |\__'_|_|_|\__'_| | Commit f71e449 (14 days old master) >>>> >>>> |__/ | x86_64-w64-mingw32 >>>> >>>> >>>> >>>> julia> using DataFrames >>>> >>>> >>>> >>>> julia> train = readtable("./test.csv"); >>>> >>>> ERROR: OutOfMemoryError() >>>> >>>> in resize! at array.jl:452 >>>> >>>> in readnrows! at >>>> >>>> C:\Users\Mustafa\.julia\v0.5\DataFrames\src\dataframe\io.jl:164 >>>> >>>> in readtable! at >>>> >>>> C:\Users\Mustafa\.julia\v0.5\DataFrames\src\dataframe\io.jl:767 >>>> >>>> in readtable at >>>> >>>> C:\Users\Mustafa\.julia\v0.5\DataFrames\src\dataframe\io.jl:847 >>>> >>>> in readtable at >>>> >>>> C:\Users\Mustafa\.julia\v0.5\DataFrames\src\dataframe\io.jl:893 >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> On Tuesday, October 13, 2015 at 3:47:58 PM UTC-4, Yichao Yu wrote: >>>> >>>>> >>>> >>>>> >>>> >>>>> On Oct 13, 2015 2:47 PM, "Grey Marsh" <[email protected]> wrote: >>>> >>>>> >>>> >>>>> Which julia version are you using. There's sime gc tweak on 0.4 >>>> for >>>> >>>>> that. >>>> >>>>> >>>> >>>>> > >>>> >>>>> > I was trying to load the training dataset from springleaf >>>> marketing >>>> >>>>> > response on Kaggle. The csv is 921 mb, has 145321 row and 1934 >>>> columns. My >>>> >>>>> > machine has 8 gb ram and julia ate 5.8gb+ memory after that I >>>> stopped julia >>>> >>>>> > as there was barely any memory left for OS to function >>>> properly. It took >>>> >>>>> > about 5-6 minutes later for the incomplete operation. I've >>>> windows 8 64bit. >>>> >>>>> > Used the following code to read the csv to Julia: >>>> >>>>> > >>>> >>>>> > using DataFrames >>>> >>>>> > train = readtable("C:\\train.csv") >>>> >>>>> > >>>> >>>>> > Next I tried to to load the same file in python: >>>> >>>>> > >>>> >>>>> > import pandas as pd >>>> >>>>> > train = pd.read_csv("C:\\train.csv") >>>> >>>>> > >>>> >>>>> > This took ~2.4gb memory, about a minute time >>>> >>>>> > >>>> >>>>> > Checking the same in R again: >>>> >>>>> > df = read.csv('E:/Libraries/train.csv', as.is = T) >>>> >>>>> > >>>> >>>>> > This took 2-3 minutes and consumes 3.5gb mem on the same >>>> machine. >>>> >>>>> > >>>> >>>>> > Why such discrepancy and why Julia even fails to load the csv >>>> before >>>> >>>>> > running out of memory? If there is any better way to get the >>>> file loaded in >>>> >>>>> > Julia? >>>> >>>>> > >>>> >>>>> > >>>> >> >>>> >> >>>> > >>>> >>> >>>
