Re: [julia-users] 900mb csv loading in Julia failed: memory comparison vs python pandas and R

Jacob Quinn Mon, 26 Oct 2015 22:30:40 -0700

Just a quick follow-up here: after some benchmarking of my own on a windows
machine, the culprit ended up being a deathly slow `strtod` system library
function on windows. It takes a few hoops to get the performance right,
which I discovered is already done in Base Julia, it just wasn't exported.


My PR to Base Julia <https://github.com/JuliaLang/julia/pull/13641> has
been accepted and is backport pending, so once Julia 0.4.1 is released,
CSV.jl will be updated to use the new code and will require that version of
Julia to enable similar great performance cross-platform.

-Jacob

On Wed, Oct 14, 2015 at 3:51 AM, bernhard <[email protected]> wrote:

> with readtable the julia process goes up to 6.3 GB and stays there. It
> takes 95 seconds. (@time shows "373M, allocations: 13GB, 7% GC time")
> I will try Jacob's approach again.
>
>
> Am Mittwoch, 14. Oktober 2015 10:59:06 UTC+2 schrieb Milan Bouchet-Valat:
>>
>> Le mercredi 14 octobre 2015 à 00:15 -0700, Grey Marsh a écrit :
>> > Done with the testing in the cloud instance.
>> > It works and the timings in my case
>> >
>> > 58.346345 seconds (694.00 M allocations: 12.775 GB, 2.63% gc time)
>> >
>> > result of "top" command:  VIRT: 11.651g RES: 3.579g
>> >
>> > ~13gb memory for a 900mb file!
>> > Thanks to Jacob atleast I was able check that the process works.
>> As Yichao noted, at no point in the import did Julia use 13GB of RAM.
>> That's the total amount of memory that was allocated and freed by
>> pieces (694M of them). You'd need to watch the Julia process while
>> working to see what's the maximum value of RES when importing.
>>
>>
>> Regards
>>
>> > On Wednesday, October 14, 2015 at 12:10:02 PM UTC+5:30, bernhard
>> > wrote:
>> > > Jacob
>> > >
>> > > I do run into the same issue as Grey. the step
>> > > ds = DataStreams.DataTable(f);
>> > > gets stuck.
>> > > I also tried this with a smaller file (150MB) which I have. This
>> > > file is read by readtable in 15s. But the DataTable function
>> > > freezes. I use 0.4 on Windows 7.
>> > >
>> > > I note that your code did work on a tiny file though (40 lines or
>> > > so).
>> > > I do get a dataframe, but when I show it (by simply typing df, or
>> > > dump(df)) Julia crashes...
>> > >
>> > > Bernhard
>> > >
>> > >
>> > > Am Mittwoch, 14. Oktober 2015 06:54:16 UTC+2 schrieb Grey Marsh:
>> > > > I am using Julia 0.4 for this purpose, if that's what is meant by
>> > > > "0.4 only".
>> > > >
>> > > > On Wednesday, October 14, 2015 at 9:53:09 AM UTC+5:30, Jacob
>> > > > Quinn wrote:
>> > > > > Oh yes, I forgot to mention that the CSV/DataStreams code is
>> > > > > 0.4 only. Definitely interested to hear about any
>> > > > > results/experiences though.
>> > > > >
>> > > > > -Jacob
>> > > > >
>> > > > > On Tue, Oct 13, 2015 at 10:11 PM, Yichao Yu <[email protected]>
>> > > > > wrote:
>> > > > > > On Wed, Oct 14, 2015 at 12:02 AM, Grey Marsh <
>> > > > > > [email protected]> wrote:
>> > > > > > > @Jacob, I tried your approach. Somehow it got stuck in the
>> > > > > > "@time ds =
>> > > > > > > DataStreams.DataTable(f)" line. After 15 minutes running,
>> > > > > > julia is using
>> > > > > > > ~500mb and 1 cpu core with no sign of end. The memory use
>> > > > > > has been almost
>> > > > > > > same for the whole duration of 15 minutes. I'm letting it
>> > > > > > run, hoping that
>> > > > > > > it finishes after some time.
>> > > > > > >
>> > > > > > > From your run, I can see it needs 12gb memory which is
>> > > > > > higher than my
>> > > > > > > machine memory of 8gb. could it be the problem?
>> > > > > >
>> > > > > > 12GB is the total number of memory ever allocated during the
>> > > > > > timing. A
>> > > > > > lot of them might be intermediate results that are freed by
>> > > > > > the GC.
>> > > > > > Also, from the output of @time, it looks like 0.4.
>> > > > > >
>> > > > > > >
>> > > > > > > On Wednesday, October 14, 2015 at 2:28:09 AM UTC+5:30,
>> > > > > > Jacob Quinn wrote:
>> > > > > > >>
>> > > > > > >> I'm hesitant to suggest, but if you're in a bind, I have
>> > > > > > an experimental
>> > > > > > >> package for fast CSV reading. The API has stabilized
>> > > > > > somewhat over the last
>> > > > > > >> week and I'm planning a more broad release soon, but I'd
>> > > > > > still consider it
>> > > > > > >> alpha mode. That said, if anyone's willing to give it a
>> > > > > > drive, you just need
>> > > > > > >> to
>> > > > > > >>
>> > > > > > >> Pkg.add("Libz")
>> > > > > > >> Pkg.add("NullableArrays")
>> > > > > > >> Pkg.clone("https://github.com/quinnj/DataStreams.jl";)
>> > > > > > >> Pkg.clone("https://github.com/quinnj/CSV.jl";)
>> > > > > > >>
>> > > > > > >> With the original file referenced here I get:
>> > > > > > >>
>> > > > > > >> julia> reload("CSV")
>> > > > > > >>
>> > > > > > >> julia> f =
>> > > > > > CSV.Source("/Users/jacobquinn/Downloads/train.csv";null="NA")
>> > > > > > >> CSV.Source: "/Users/jacobquinn/Downloads/train.csv"
>> > > > > > >> delim: ','
>> > > > > > >> quotechar: '"'
>> > > > > > >> escapechar: '\\'
>> > > > > > >> null: "NA"
>> > > > > > >> schema:
>> > > > > > >>
>> > > > > > DataStreams.Schema(UTF8String["ID","VAR_0001","VAR_0002","VAR
>> > > > > > _0003","VAR_0004","VAR_0005","VAR_0006","VAR_0007","VAR_0008"
>> > > > > > ,"VAR_0009"
>> > > > > > >> …
>> > > > > > >>
>> > > > > > "VAR_1926","VAR_1927","VAR_1928","VAR_1929","VAR_1930","VAR_1
>> > > > > > 931","VAR_1932","VAR_1933","VAR_1934","target"],[Int64,DataSt
>> > > > > > reams.PointerString,Int64,Int64,Int64,DataStreams.PointerStri
>> > > > > > ng,Int64,Int64,DataStreams.PointerString,DataStreams.PointerS
>> > > > > > tring
>> > > > > > >> …
>> > > > > > >>
>> > > > > > Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,DataStreams.P
>> > > > > > ointerString,Int64],145231,1934)
>> > > > > > >> dateformat:
>> > > > > > Base.Dates.DateFormat(Base.Dates.Slot[],"","english")
>> > > > > > >>
>> > > > > > >>
>> > > > > > >> julia> @time ds = DataStreams.DataTable(f)
>> > > > > > >>  43.513800 seconds (694.00 M allocations: 12.775 GB, 2.55%
>> > > > > > gc time)
>> > > > > > >>
>> > > > > > >>
>> > > > > > >> You can convert the result to a DataFrame with:
>> > > > > > >>
>> > > > > > >> function DataFrames.DataFrame(dt::DataStreams.DataTable)
>> > > > > > >>     cols = dt.schema.cols
>> > > > > > >>     data = Array(Any,cols)
>> > > > > > >>     types = DataStreams.types(dt)
>> > > > > > >>     for i = 1:cols
>> > > > > > >>         data[i] = DataStreams.column(dt,i,types[i])
>> > > > > > >>     end
>> > > > > > >>     return DataFrame(data,Symbol[symbol(x) for x in
>> > > > > > dt.schema.header])
>> > > > > > >> end
>> > > > > > >>
>> > > > > > >>
>> > > > > > >> -Jacob
>> > > > > > >>
>> > > > > > >> On Tue, Oct 13, 2015 at 2:40 PM, feza <[email protected]>
>> > > > > > wrote:
>> > > > > > >>>
>> > > > > > >>> Finally was able to load it, but the process   consumes a
>> > > > > > ton of memory.
>> > > > > > >>> julia> @time train = readtable("./test.csv");
>> > > > > > >>> 124.575362 seconds (376.11 M allocations: 13.438 GB,
>> > > > > > 10.77% gc time)
>> > > > > > >>>
>> > > > > > >>>
>> > > > > > >>>
>> > > > > > >>> On Tuesday, October 13, 2015 at 4:34:05 PM UTC-4, feza
>> > > > > > wrote:
>> > > > > > >>>>
>> > > > > > >>>> Same here on a 12gb ram machine
>> > > > > > >>>>
>> > > > > > >>>>                _
>> > > > > > >>>>    _       _ _(_)_     |  A fresh approach to technical
>> > > > > > computing
>> > > > > > >>>>   (_)     | (_) (_)    |  Documentation:
>> > > > > > http://docs.julialang.org
>> > > > > > >>>>    _ _   _| |_  __ _   |  Type "?help" for help.
>> > > > > > >>>>   | | | | | | |/ _` |  |
>> > > > > > >>>>   | | |_| | | | (_| |  |  Version 0.5.0-dev+429 (2015-09
>> > > > > > -29 09:47 UTC)
>> > > > > > >>>>  _/ |\__'_|_|_|\__'_|  |  Commit f71e449 (14 days old
>> > > > > > master)
>> > > > > > >>>> |__/                   |  x86_64-w64-mingw32
>> > > > > > >>>>
>> > > > > > >>>> julia> using DataFrames
>> > > > > > >>>>
>> > > > > > >>>> julia> train = readtable("./test.csv");
>> > > > > > >>>> ERROR: OutOfMemoryError()
>> > > > > > >>>>  in resize! at array.jl:452
>> > > > > > >>>>  in readnrows! at
>> > > > > > >>>>
>> > > > > > C:\Users\Mustafa\.julia\v0.5\DataFrames\src\dataframe\io.jl:1
>> > > > > > 64
>> > > > > > >>>>  in readtable! at
>> > > > > > >>>>
>> > > > > > C:\Users\Mustafa\.julia\v0.5\DataFrames\src\dataframe\io.jl:7
>> > > > > > 67
>> > > > > > >>>>  in readtable at
>> > > > > > >>>>
>> > > > > > C:\Users\Mustafa\.julia\v0.5\DataFrames\src\dataframe\io.jl:8
>> > > > > > 47
>> > > > > > >>>>  in readtable at
>> > > > > > >>>>
>> > > > > > C:\Users\Mustafa\.julia\v0.5\DataFrames\src\dataframe\io.jl:8
>> > > > > > 93
>> > > > > > >>>>
>> > > > > > >>>>
>> > > > > > >>>>
>> > > > > > >>>>
>> > > > > > >>>>
>> > > > > > >>>> On Tuesday, October 13, 2015 at 3:47:58 PM UTC-4, Yichao
>> > > > > > Yu wrote:
>> > > > > > >>>>>
>> > > > > > >>>>>
>> > > > > > >>>>> On Oct 13, 2015 2:47 PM, "Grey Marsh" <
>> > > > > > [email protected]> wrote:
>> > > > > > >>>>>
>> > > > > > >>>>> Which julia version are you using. There's sime gc
>> > > > > > tweak on 0.4 for
>> > > > > > >>>>> that.
>> > > > > > >>>>>
>> > > > > > >>>>> >
>> > > > > > >>>>> > I was trying to load the training dataset from
>> > > > > > springleaf marketing
>> > > > > > >>>>> > response on Kaggle. The csv is 921 mb, has 145321 row
>> > > > > > and 1934 columns. My
>> > > > > > >>>>> > machine has 8 gb ram and julia ate 5.8gb+ memory
>> > > > > > after that I stopped julia
>> > > > > > >>>>> > as there was barely any memory left for OS to
>> > > > > > function properly. It took
>> > > > > > >>>>> > about 5-6 minutes later for the incomplete operation.
>> > > > > > I've windows 8  64bit.
>> > > > > > >>>>> > Used the following code to read the csv to Julia:
>> > > > > > >>>>> >
>> > > > > > >>>>> > using DataFrames
>> > > > > > >>>>> > train = readtable("C:\\train.csv")
>> > > > > > >>>>> >
>> > > > > > >>>>> > Next I tried to to load the same file in python:
>> > > > > > >>>>> >
>> > > > > > >>>>> > import pandas as pd
>> > > > > > >>>>> > train = pd.read_csv("C:\\train.csv")
>> > > > > > >>>>> >
>> > > > > > >>>>> > This took ~2.4gb memory, about a minute time
>> > > > > > >>>>> >
>> > > > > > >>>>> > Checking the same in R again:
>> > > > > > >>>>> > df = read.csv('E:/Libraries/train.csv', as.is = T)
>> > > > > > >>>>> >
>> > > > > > >>>>> > This took 2-3 minutes and consumes 3.5gb mem on the
>> > > > > > same machine.
>> > > > > > >>>>> >
>> > > > > > >>>>> > Why such discrepancy and why Julia even fails to load
>> > > > > > the csv before
>> > > > > > >>>>> > running out of memory? If there is any better way to
>> > > > > > get the file loaded in
>> > > > > > >>>>> > Julia?
>> > > > > > >>>>> >
>> > > > > > >>>>> >
>> > > > > > >>
>> > > > > > >>
>> > > > > > >
>> > > > > >
>> > > > >
>>
>

Re: [julia-users] 900mb csv loading in Julia failed: memory comparison vs python pandas and R

Reply via email to