I suppose it depends on your goal.

My earlier feedback was that doing a true scan is often detrimental
for benchmarking since I/O time can often dominate the results.  Also,
to get the best scan results, you often spend a lot of time
micromanaging the file format / compression / file layout / etc.  That
was why I had recommended going with a TableSourceNode if you were
build a benchmark to focus understanding of a single node.

On the other hand, if your goal is understanding end-to-end query
times, then a table source node is probably not what you would start
with.

One useful number, regardless of how you are inputting your data, is
the "total size of all data".  You wouldn't get that from a memory
pool though.  You could get that by calling the utilities in
src/arrow/util/byte_size.h on your table.  This might give you
something to compare/contrast allocation of an individual node with.

On Mon, Jul 11, 2022 at 2:04 PM Li Jin <ice.xell...@gmail.com> wrote:
>
> > TableSourceNode wouldn't need to allocate since it runs against memory
> that's already been allocated.
> Is the memory "that is already allocated" tracked in any allocators? For an
> end to end benchmark of "scan - join - write" I think would make sense to
> include all arrow memory allocation (if that makes sense)
>
> On Mon, Jul 11, 2022 at 4:37 PM Weston Pace <weston.p...@gmail.com> wrote:
>
> > > Is there anything else I'd need to change?
> >
> > Maybe try something like this:
> >
> > https://github.com/westonpace/arrow/commit/15ac0d051136c585cda63297e48f17557808d898
> >
> > > Beyond that, we should also expect to see some allocations from
> > TableSourceNode going through the logging memory pool, even if AsOfJoinNode
> > was using the default memory pool instead of the Exec Plan's pool, but I am
> > not seeing anything come through...
> >
> > TableSourceNode wouldn't need to allocate since it runs against memory
> > that's already been allocated.  It might split input into smaller
> > batches but slicing tables / arrays is a zero-copy operation that does
> > not require allocating new buffers.
> >
> > On Mon, Jul 11, 2022 at 12:46 PM Ivan Chau <ivan.c...@twosigma.com> wrote:
> > >
> > > Yeah this behavior is certainly a bit strange then.
> > >
> > > The only alteration I am making is changing the way we create the
> > Execution Context in the benchmark file.
> > >
> > > Something like:
> > >
> > > ```
> > > auto logging_pool = LoggingMemoryPool(default_memory_pool());
> > > ExecContext ctx(&logging_pool, ...);
> > > ```
> > >
> > > Is there anything else I'd need to change?
> > >
> > > Beyond that, we should also expect to see some allocations from
> > TableSourceNode going through the logging memory pool, even if AsOfJoinNode
> > was using the default memory pool instead of the Exec Plan's pool, but I am
> > not seeing anything come through...
> > >
> > > -----Original Message-----
> > > From: Weston Pace <weston.p...@gmail.com>
> > > Sent: Monday, July 11, 2022 2:47 PM
> > > To: dev@arrow.apache.org
> > > Subject: Re: cpp Memory Pool Clarification
> > >
> > > Are you changing the default memory pool to a LoggingMemoryPool?
> > > Where are you doing this?  For a benchmark I think you would need to
> > change the implementation in the benchmark file itself.
> > >
> > > Similarly, is AsofJoinNode using the default memory pool or the memory
> > pool of the exec plan?  It should be exclusively using the latter but it's
> > easy sometimes to overlook using the default memory pool.  It probably
> > won't make too much of a difference at the end of the day as benchmarks
> > normally configure an exec plan to use the default memory pool and so the
> > two pools would be the same.
> > >
> > > > My expectation is that we would see some pretty sizable calls to
> > Allocate when we begin to read files or to create tables, but that is not
> > evident.
> > >
> > > Yes, the materializtion step of an asof join uses array builders and
> > those will be allocating buffers from a memory pool.
> > >
> > > > 1) To my understanding, only large allocations will call Allocate. Are
> > > > there allocations (for files, table objects), which despite being of
> > > > large size, do not call Allocate?
> > >
> > > No.  There is no size limit for the allocator.  Instead, when people
> > were talking about "large allocations" and "small allocations" in the
> > previous thread is was more of a general concept.
> > >
> > > For example, if I create an array builder, add some items to it, and
> > then create an array then this will always use a memory pool for the
> > allocation.  This will be true even if I create an array with a single
> > element in it (in which case the allocation is often padded for alignment
> > purposes).
> > >
> > > On the other hand, schemas keep their fields in a std::vector which
> > never uses the memory pool for allocation.  This is true even if I have
> > 10,000 columns and the vector's memory is actually quite large.
> > >
> > > However, in general, arrays tend to be quite large and schemas tend to
> > be quite small.
> > >
> > > > 2) How can maximum_peak_memory be nonzero if we have not seen any
> > > > calls to Allocate/Reallocate/Free?
> > >
> > > I don't think that is possible.
> > >
> > > On Mon, Jul 11, 2022 at 10:44 AM Ivan Chau <ivan.m.c...@gmail.com>
> > wrote:
> > > >
> > > > Hi all,
> > > >
> > > > I've been doing some testing with LoggingMemoryPool to benchmark our
> > > > AsOfJoin implementation
> > > > <
> > https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/exec/asof_join_node.cc
> > >.
> > > > Our underlying memory pool for the LoggingMemoryPool is the
> > > > default_memory_pool (this is process-wide).
> > > >
> > > > Curiously enough, I don't see any allocations, reallocations, or frees
> > > > when we run our benchmarking code. I also see that the max_memory
> > > > property of the memory pool (which is documented as the peak memory
> > > > allocation), is nonzero (1.2e9 bytes).
> > > >
> > > > My expectation is that we would see some pretty sizable calls to
> > > > Allocate when we begin to read files or to create tables, but that is
> > not evident.
> > > >
> > > > 1) To my understanding, only large allocations will call Allocate. Are
> > > > there allocations (for files, table objects), which despite being of
> > > > large size, do not call Allocate?
> > > >
> > > > 2) How can maximum_peak_memory be nonzero if we have not seen any
> > > > calls to Allocate/Reallocate/Free?
> > > >
> > > > Thank you!
> >

Reply via email to