Please enjoy your vacation!

I was able to track it down to the allocations performed when creating
futures.  Simply changing from Executor.Spawn to Executor.Submit
explains nearly half of the performance difference.  Perhaps someday
an investigation can be done to look at pooled allocators.  It seems
we should, at the very least, be able to easily reuse instances of
FutureImpl, FutureStorage<void> and FutureStorage<Status>.

However, for the moment, there is no need for that degree of
performance.  I'll leave futures out of the existing TaskGroup methods
and move on ahead to working on the reader.

On Tue, Oct 27, 2020 at 4:48 AM Antoine Pitrou <anto...@python.org> wrote:
>
>
>
> Le 27/10/2020 à 15:47, Antoine Pitrou a écrit :
> >
> > Hi Weston,
> >
> > Note: I'm on vacation, so won't be able to look at this before ~2 weeks.
> >
> > For information, there's a micro-benchmark of thread pools and task
> > groups in src/arrow/util/thread_pool_benchmark.cc.  It should allow you
> > isolate performance concerns a bit better.
>
> Oops, sorry, just saw that you already using it.  That's what I get for
> reading too quickly.
>
> Regards
>
> Antoine.
>
>
> >
> > Regards
> >
> > Antoine.
> >
> >
> > Le 26/10/2020 à 16:48, Weston Pace a écrit :
> >> Hi all,
> >>
> >> I've completed the initial composable futures API and iterator work.
> >> The CSV reader portion is still WIP.
> >>
> >> First, I'm interested in getting any feedback on the futures API.  In
> >> particular Future<T>.Then in future.h (and the type erased
> >> Composable.Compose).  The actual implementation can probably be
> >> cleaned up with regards to DRY (the 10 specializations of the Continue
> >> function) which I plan to do at the end.
> >>
> >> This approach is a little different than my earlier prototype.  In the
> >> prototype it would always submit continuations on the thread pool as
> >> new tasks.  Instead I've changed it so continuations will run
> >> synchronously when the future is marked complete.  If there is a
> >> desire to move the continuation into a thread pool task it can be done
> >> with Executor.Transfer.  As an example usage this is done in
> >> AsyncForEachHelper so that the applied for-each function is not run on
> >> the reader thread.
> >>
> >> Second, and perhaps what I'm more interested in, I've switched
> >> ThreadedTaskGroup to using futures (e.g. using Compose to add a
> >> callback that calls OneTaskDone instead of making a wrapper lambda
> >> function).  In theory this should be more or less the exact same work
> >> as the previous task group implementation.  However, I am seeing
> >> noticeable overhead in arrow-thread-pool-benchmark for small tasks.
> >> The benchmark runs on my system at ~950k items/s for no task group,
> >> ~890k items/s with the old task group implementation, and ~450k
> >> items/s with the futures based implementation.  The change is isolated
> >> to one method in task_group.cc so if you replace the method at line
> >> 102 with the commented out version at line 127 the original
> >> performance returns.  I've verified that the task is not getting
> >> copied.  There are a few extra moves and function calls and futures
> >> have to be created and copied around so it is possible that is the
> >> cause of it but I'm curious if a second eye could see some other cause
> >> for the degradation that I am missing.  I'll also be seeing if I can
> >> get gprof running later in hopes that can provide some insight.
> >> However, I probably won't spend too much more time on it before
> >> finishing up the CSV reader work and checking the performance of the
> >> CSV reader.
> >>
> >> If I can't figure out the cause of the performance I can always allow
> >> task group to keep the implementation it has for Append(task) while
> >> using future for Append(future).  I suspect that the CSV reader tasks
> >> are long enough tasks that the overhead won't be an issue.
> >>
> >> Code: 
> >> https://github.com/apache/arrow/compare/master...westonpace:feature/arrow-10183?expand=1
> >>
> >> -Weston
> >>

Reply via email to