Re: cpp Memory Pool Clarification

Weston Pace Tue, 12 Jul 2022 19:20:04 -0700

A sink doesn't make allocations so I wouldn't expect to see any there.
Sources may make allocations and it depends on the source.  For
example, a "scan" source will make allocations.  However, a scan node
will use the scan options to get the memory pool (this is somewhat
unfortunate and I'd like to improve the scan node to get the memory
pool from the exec plan.)  So, for example, if you made a "scan" /
"sink" plan then it would be possible to not see anything get logged.


[1] is a simple program that runs a scan/sink plan but it also sets
the memory pool on the scan options.  When I run this application I do
get logged output about allocations.  If I don't assign the pool to
the scan options then I get nothing.

[1] https://gist.github.com/westonpace/7babc845db265cfedc3d789b8d199d1e

On Tue, Jul 12, 2022 at 9:10 AM Ivan Chau <ivan.c...@twosigma.com> wrote:
>
> Would this also explain the lack of allocations, reallocations or frees when 
> creating a pipeline with just a source and a sink?
>
> For example, we do not see logs for a regular source, a table source node, or 
> a streaming file reader node (using RecordBatchFileReader and 
> MakeReaderGenerator) to generate for a regular source node.
>
> -----Original Message-----
> From: Weston Pace <weston.p...@gmail.com>
> Sent: Monday, July 11, 2022 4:37 PM
> To: dev@arrow.apache.org
> Subject: Re: cpp Memory Pool Clarification
>
> > Is there anything else I'd need to change?
>
> Maybe try something like this:
> https://github.com/westonpace/arrow/commit/15ac0d051136c585cda63297e48f17557808d898
>
> > Beyond that, we should also expect to see some allocations from 
> > TableSourceNode going through the logging memory pool, even if AsOfJoinNode 
> > was using the default memory pool instead of the Exec Plan's pool, but I am 
> > not seeing anything come through...
>
> TableSourceNode wouldn't need to allocate since it runs against memory that's 
> already been allocated.  It might split input into smaller batches but 
> slicing tables / arrays is a zero-copy operation that does not require 
> allocating new buffers.
>
> On Mon, Jul 11, 2022 at 12:46 PM Ivan Chau <ivan.c...@twosigma.com> wrote:
> >
> > Yeah this behavior is certainly a bit strange then.
> >
> > The only alteration I am making is changing the way we create the Execution 
> > Context in the benchmark file.
> >
> > Something like:
> >
> > ```
> > auto logging_pool = LoggingMemoryPool(default_memory_pool());
> > ExecContext ctx(&logging_pool, ...);
> > ```
> >
> > Is there anything else I'd need to change?
> >
> > Beyond that, we should also expect to see some allocations from 
> > TableSourceNode going through the logging memory pool, even if AsOfJoinNode 
> > was using the default memory pool instead of the Exec Plan's pool, but I am 
> > not seeing anything come through...
> >
> > -----Original Message-----
> > From: Weston Pace <weston.p...@gmail.com>
> > Sent: Monday, July 11, 2022 2:47 PM
> > To: dev@arrow.apache.org
> > Subject: Re: cpp Memory Pool Clarification
> >
> > Are you changing the default memory pool to a LoggingMemoryPool?
> > Where are you doing this?  For a benchmark I think you would need to change 
> > the implementation in the benchmark file itself.
> >
> > Similarly, is AsofJoinNode using the default memory pool or the memory pool 
> > of the exec plan?  It should be exclusively using the latter but it's easy 
> > sometimes to overlook using the default memory pool.  It probably won't 
> > make too much of a difference at the end of the day as benchmarks normally 
> > configure an exec plan to use the default memory pool and so the two pools 
> > would be the same.
> >
> > > My expectation is that we would see some pretty sizable calls to Allocate 
> > > when we begin to read files or to create tables, but that is not evident.
> >
> > Yes, the materializtion step of an asof join uses array builders and those 
> > will be allocating buffers from a memory pool.
> >
> > > 1) To my understanding, only large allocations will call Allocate.
> > > Are there allocations (for files, table objects), which despite
> > > being of large size, do not call Allocate?
> >
> > No.  There is no size limit for the allocator.  Instead, when people were 
> > talking about "large allocations" and "small allocations" in the previous 
> > thread is was more of a general concept.
> >
> > For example, if I create an array builder, add some items to it, and then 
> > create an array then this will always use a memory pool for the allocation. 
> >  This will be true even if I create an array with a single element in it 
> > (in which case the allocation is often padded for alignment purposes).
> >
> > On the other hand, schemas keep their fields in a std::vector which never 
> > uses the memory pool for allocation.  This is true even if I have 10,000 
> > columns and the vector's memory is actually quite large.
> >
> > However, in general, arrays tend to be quite large and schemas tend to be 
> > quite small.
> >
> > > 2) How can maximum_peak_memory be nonzero if we have not seen any
> > > calls to Allocate/Reallocate/Free?
> >
> > I don't think that is possible.
> >
> > On Mon, Jul 11, 2022 at 10:44 AM Ivan Chau <ivan.m.c...@gmail.com> wrote:
> > >
> > > Hi all,
> > >
> > > I've been doing some testing with LoggingMemoryPool to benchmark our
> > > AsOfJoin implementation
> > > <https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/exec/asof_join_node.cc>.
> > > Our underlying memory pool for the LoggingMemoryPool is the
> > > default_memory_pool (this is process-wide).
> > >
> > > Curiously enough, I don't see any allocations, reallocations, or
> > > frees when we run our benchmarking code. I also see that the
> > > max_memory property of the memory pool (which is documented as the
> > > peak memory allocation), is nonzero (1.2e9 bytes).
> > >
> > > My expectation is that we would see some pretty sizable calls to
> > > Allocate when we begin to read files or to create tables, but that is not 
> > > evident.
> > >
> > > 1) To my understanding, only large allocations will call Allocate.
> > > Are there allocations (for files, table objects), which despite
> > > being of large size, do not call Allocate?
> > >
> > > 2) How can maximum_peak_memory be nonzero if we have not seen any
> > > calls to Allocate/Reallocate/Free?
> > >
> > > Thank you!

Re: cpp Memory Pool Clarification

Reply via email to