Thanks,
I've tried the new code and that seems to have shaved about 1GB of memory
off, so the heap is about 8.84GB now, here is the updated pprof output
https://i.imgur.com/itOHqBf.png
It looks like the majority of allocations are in the memory.GoAllocator
(pprof) top
Showing nodes accounting for 8.84GB, 100% of 8.84GB total
Showing top 10 nodes out of 41
flat flat% sum% cum cum%
4.24GB 47.91% 47.91% 4.24GB 47.91%
github.com/apache/arrow/go/arrow/memory.(*GoAllocator).Allocate
2.12GB 23.97% 71.88% 2.12GB 23.97%
github.com/apache/arrow/go/arrow/memory.NewResizableBuffer (inline)
1.07GB 12.07% 83.95% 1.07GB 12.07%
github.com/apache/arrow/go/arrow/array.NewData
0.83GB 9.38% 93.33% 0.83GB 9.38%
github.com/apache/arrow/go/arrow/array.NewStringData
0.33GB 3.69% 97.02% 1.31GB 14.79%
github.com/apache/arrow/go/arrow/array.(*BinaryBuilder).newData
0.18GB 2.04% 99.06% 0.18GB 2.04%
github.com/apache/arrow/go/arrow/array.NewChunked
0.07GB 0.78% 99.85% 0.07GB 0.78%
github.com/apache/arrow/go/arrow/array.NewInt64Data
0.01GB 0.15% 100% 0.21GB 2.37%
github.com/apache/arrow/go/arrow/array.(*Int64Builder).newData
0 0% 100% 6GB 67.91%
github.com/apache/arrow/go/arrow/array.(*BinaryBuilder).Append
0 0% 100% 4.03GB 45.54%
github.com/apache/arrow/go/arrow/array.(*BinaryBuilder).Reserve
I'm a bit busy at the moment but I'll probably repeat the same test on the
other Arrow implementations (e.g. Java) to see if they allocate a similar
amount.
Daniel Harper
http://djhworld.github.io
On Mon, 19 Nov 2018 at 10:17, Sebastien Binet <[email protected]> wrote:
> hi Daniel,
> On Sun, Nov 18, 2018 at 10:17 PM Daniel Harper <[email protected]>
> wrote:
>
> > Sorry just realised SVG doesn't work.
> >
> > PNG of the pprof can be found here: https://i.imgur.com/BVXv1Jm.png
> >
> >
> > Daniel Harper
> > http://djhworld.github.io
> >
> >
> > On Sun, 18 Nov 2018 at 21:07, Daniel Harper <[email protected]>
> wrote:
> >
> > > Wasn't sure where the best place to discuss this, but I've noticed that
> > > when running the following piece of code
> > >
> > > https://play.golang.org/p/SKkqPWoHPPS
> > >
> > > On a CSV files that contains roughly 1 million records (about 100mb of
> > > data), the memory usage of the process leaps to about 9.1GB
> > >
> > > The records look something like this
> > >
> > >
> > >
> >
> "2018-08-27T20:00:00Z","cdnA","dash","audio","http","programme-1","3577","2018","08","27","2018-08-27","live"
> > >
> > >
> >
> "2018-08-27T20:00:01Z","cdnB","hls","video","https","programme-2","14","2018","08","27","2018-08-27","ondemand"
> > >
> > > I've attached a pprof output of the process.
> > >
> > > From the looks of it the heavy use of _strings_ might be where most of
> > the
> > > memory is going.
> > >
> > > Is this expected? I'm new to the code, happy to help where possible!
> >
>
> it's somewhat expected.
>
> you use `io.ReadFile` to get your data.
> this will read the whole file in memory and stick it there: so there's
> that.
> for much bigger files, I would recommend using `os.Open`.
>
> also, you don't release the individual records once passed to the table, so
> you have a memory leak.
> here is my current attempt:
> - https://play.golang.org/p/ns3GJW6Wx3T
>
> finally, as I was alluding to on the #data-science slack channel, right now
> Go arrow/csv will create a new Record for each row in the incoming CSV
> file.
> so you get a bunch of overhead for every row/record.
>
> a much more efficient way would be to chunk `n` rows into a single Record.
> an even more efficient way would be to create a dedicated csv.table type
> that implements array.Table (as it seems you're interested in using that
> interface) but only reads the incoming CSV file piecewise (ie: implementing
> the chunking I was alluding to above but w/o having to load the whole
> []Record slice.)
>
> as a first step to improve this issue, implementing chunking would already
> shave off a bunch of overhead.
>
> -s
>