From: EXT Bill Fischofer [mailto:[email protected]] Sent: Monday, October 19, 2015 6:23 PM To: Ola Liljedahl Cc: Savolainen, Petri (Nokia - FI/Espoo); LNG ODP Mailman List Subject: Re: [lng-odp] Bug 1851 - odp_pool_destroy() failure
On Mon, Oct 19, 2015 at 10:07 AM, Ola Liljedahl <[email protected]<mailto:[email protected]>> wrote: On 19 October 2015 at 16:43, Bill Fischofer <[email protected]<mailto:[email protected]>> wrote: We could do that in linux-generic, which has a fairly small number of threads supported. I'd be concerned about how that would scale to systems that can support many more threads, especially when NUMA considerations come into play. Is it simply unacceptable to have some sort of "finished" API call? That would seem so solve the problem in a clean and scalable manner. Memory barrier is different from execution barrier. Alloc/free/destroy should scale well also when thread local stash is stored in shared memory, no matter how many threads there are or with multi-chip interconnect. Alloc and free work in per thread slice of the stash (== no synchronization needed), only destroy needs to go an read the entire stash after application threads have stopped using it (== no synchronization needed). Finish call would complicate the API, user would need to 1) ensure that application does not use the pool any more (which can be done in many ways) 2) schedule the finish() call to be executed on all threads 3) wait and synchronize to notice when all thread have called finish() (what if one of the threads did exit before calling it) 4) call destroy on one thread Currently we have only steps 1 and 4. Isn't this conceptually similar to the stop scheduling call so that I can drain the prescheduling queue and then stop participating in event processing? In order to allow for "non-ideal" implementations (because instant sharing of all resources isn't always very performant), we create API's that tell ODP that this thread wishes to withdraw from processing using shared resources. It is not. Schedule_pause() tells that this thread is now stepping out from the scheduling loop, so that some prescheduled events can be processed and those flows will not deadlock if the thread would not return. Only application process the stash of pre-scheduled events. Implementation can move pre-allocated buffers back to global pool in thread termination, without applications help. I think that's a useful analogy. We've recently added stop/start APIs to pktio for similar reasons, and of course we have odp_schedule_pause() that serves the same advisory function. We don't need a "start" API for pools (though if you wanted one for symmetry I don't see any harm there) but you really do want a "stop" API. Pktio start/stop are also different. Those controls event input from an external source (the network), not potential per thread stashing. There’s no requirement for each thread to call pktio stop when application wants to stop incoming packets from the network. -Petri On Mon, Oct 19, 2015 at 9:15 AM, Savolainen, Petri (Nokia - FI/Espoo) <[email protected]<mailto:[email protected]>> wrote: A SW implementation can place the per thread stash into shared memory where the thread calling destroy() can see stashes of all other threads. Since application must synchronize the destroy call (to happen only after all free() calls have returned), implementation must just ensure that the destroy call reads fresh stash status data (== it has correct memory read/write barriers in place). Performance should be still good – it’s matter of moving the per thread stash from TLS to shared memory (no additional synchronization per alloc/free). -Petri From: EXT Bill Fischofer [mailto:[email protected]<mailto:[email protected]>] Sent: Monday, October 19, 2015 2:26 PM To: Savolainen, Petri (Nokia - FI/Espoo) Cc: LNG ODP Mailman List Subject: Re: [lng-odp] Bug 1851 - odp_pool_destroy() failure This is an important discussion, especially as we look to high-performance SW implementations of ODP. Obviously we can stipulate any functional behavior we want. The question is how much overhead is acceptable to achieve such stipulated functionality? One of the reasons DPDK does not support mempool destroys is this issue of distributed cache management. If we don't want the application to take any responsibility in this area, then the implementation needs to impose additional bookkeeping overhead that will likely impact the performance of normal operation. What's needed is some sort of indication that a thread is not just freeing a buffer, but is done with operations on a pool. One way of doing this is to add an odp_pool_finished() API that tells the implementation that this thread is done with the pool (e.g., asserts that no further alloc() calls will be made by this thread on it). My suggestion in the response to the bug was that odp_pool_destory() can serve this purpose, however I'd have no problem with adding another API that serves the same notification purpose. Without such an API, it's not clear how we can achieve the desired functionality without a lot of additional overhead or removing any sort of safety checks. If the latter is acceptable, we could say that odp_pool_destroy() always succeeds and if the application had any outstanding buffers or tries to use the pool handle following a destroy() call then the result is undefined. On Mon, Oct 19, 2015 at 5:48 AM, Savolainen, Petri (Nokia - FI/Espoo) <[email protected]<mailto:[email protected]>> wrote: Hi, Linux-generic pool implementation has a bug ( https://bugs.linaro.org/show_bug.cgi?id=1851 ) that prevents dynamic pool destroy. From API point of view, any resource (e.g. pool) is created once ( xxx_create call returns a handle) and destroyed once (pass the handle to xxx_destroy). Any thread can create a resource and any thread can destroy it. Application threads must synchronize resource usage and destroy call, but not implementation specifics like potential usage of per thread stashes or flush of those. For example, this valid usage of the pool API: Thread 1 Thread 2 Thread 3 -------------------------------------------------- init_global() init_local() init_local() init_local() pool = pool_create() barrier() barrier() barrier() buf = alloc(pool) buf = alloc(pool) buf = alloc(pool) free(buf) free(buf) free(buf) barrier() barrier() barrier() pool_destroy(pool) barrier() barrier() barrier() do_something() do_something() do_something() term_local() term_local() term_local() term_global() So, e.g. pool_destroy must succeed when all buffers have been freed before the call - no matter: * which thread calls it * has the calling thread itself called alloc or free * have other threads called already term_local -Petri _______________________________________________ lng-odp mailing list [email protected]<mailto:[email protected]> https://lists.linaro.org/mailman/listinfo/lng-odp _______________________________________________ lng-odp mailing list [email protected]<mailto:[email protected]> https://lists.linaro.org/mailman/listinfo/lng-odp
_______________________________________________ lng-odp mailing list [email protected] https://lists.linaro.org/mailman/listinfo/lng-odp
