Hi Bruce, Replies are inline,
Thank, Cheng > -----Original Message----- > From: Richardson, Bruce <bruce.richard...@intel.com> > Sent: Monday, January 30, 2023 5:20 PM > To: Jiang, Cheng1 <cheng1.ji...@intel.com> > Cc: tho...@monjalon.net; m...@smartsharesystems.com; dev@dpdk.org; > Hu, Jiayu <jiayu...@intel.com>; Ding, Xuan <xuan.d...@intel.com>; Ma, > WenwuX <wenwux...@intel.com>; Wang, YuanX > <yuanx.w...@intel.com>; He, Xingguang <xingguang...@intel.com> > Subject: Re: [PATCH v3] app/dma-perf: introduce dma-perf application > > On Sat, Jan 28, 2023 at 01:32:05PM +0000, Jiang, Cheng1 wrote: > > Hi Bruce, > > > > Sorry for the late reply. We are in the Spring Festival holiday last week. > > Thanks for your comments. > > Replies are inline. > > > > Thanks, > > Cheng > > > > > -----Original Message----- > > > From: Richardson, Bruce <bruce.richard...@intel.com> > > > Sent: Wednesday, January 18, 2023 12:52 AM > > > To: Jiang, Cheng1 <cheng1.ji...@intel.com> > > > Cc: tho...@monjalon.net; m...@smartsharesystems.com; > dev@dpdk.org; Hu, > > > Jiayu <jiayu...@intel.com>; Ding, Xuan <xuan.d...@intel.com>; Ma, > > > WenwuX <wenwux...@intel.com>; Wang, YuanX > <yuanx.w...@intel.com>; > > > He, Xingguang <xingguang...@intel.com> > > > Subject: Re: [PATCH v3] app/dma-perf: introduce dma-perf application > > > > > > On Tue, Jan 17, 2023 at 12:05:26PM +0000, Cheng Jiang wrote: > > > > There are many high-performance DMA devices supported in DPDK > now, > > > > and these DMA devices can also be integrated into other modules of > > > > DPDK as accelerators, such as Vhost. Before integrating DMA into > > > > applications, developers need to know the performance of these DMA > > > > devices in various scenarios and the performance of CPUs in the > > > > same scenario, such as different buffer lengths. Only in this way > > > > can we know the target performance of the application accelerated > > > > by using them. This patch introduces a high-performance testing > > > > tool, which supports comparing the performance of CPU and DMA in > > > > different scenarios automatically with a pre-set config file. > > > > Memory Copy performance test are > > > supported for now. > > > > > > > > Signed-off-by: Cheng Jiang <cheng1.ji...@intel.com> > > > > Signed-off-by: Jiayu Hu <jiayu...@intel.com> > > > > Signed-off-by: Yuan Wang <yuanx.w...@intel.com> > > > > Acked-by: Morten Brørup <m...@smartsharesystems.com> > > > > --- > > > > > > More input based off trying running the application, including some > > > thoughts on the testing methodology below. > > > > > > > > > > +static void > > > > +output_result(uint8_t scenario_id, uint32_t lcore_id, uint16_t > > > > +dev_id, > > > uint64_t ave_cycle, > > > > + uint32_t buf_size, uint32_t nr_buf, uint32_t > > > > memory, > > > > + float bandwidth, uint64_t ops, bool is_dma) { > > > > + if (is_dma) > > > > + printf("lcore %u, DMA %u:\n" > > > > + "average cycles: %" PRIu64 "," > > > > + " buffer size: %u, nr_buf: %u," > > > > + " memory: %uMB, frequency: %" PRIu64 > ".\n", > > > > + lcore_id, > > > > + dev_id, > > > > + ave_cycle, > > > > + buf_size, > > > > + nr_buf, > > > > + memory, > > > > + rte_get_timer_hz()); > > > > + else > > > > + printf("lcore %u\n" > > > > + "average cycles: %" PRIu64 "," > > > > + " buffer size: %u, nr_buf: %u," > > > > + " memory: %uMB, frequency: %" PRIu64 ".\n", > > > > + lcore_id, > > > > + ave_cycle, > > > > + buf_size, > > > > + nr_buf, > > > > + memory, > > > > + rte_get_timer_hz()); > > > > + > > > > > > The term "average cycles" is unclear here - is it average cycles per > > > test iteration, or average cycles per buffer copy? > > > > The average cycles stands for average cycles per buffer copy, I'll clarify > > it in > the next version. > > > > > > > > > > > > + printf("Average bandwidth: %.3lfGbps, OPS: %" PRIu64 "\n", > > > > +bandwidth, ops); > > > > + > > > > > > <snip> > > > > > > > + > > > > +static inline void > > > > +do_dma_mem_copy(uint16_t dev_id, uint32_t nr_buf, uint16_t > > > > +kick_batch, > > > uint32_t buf_size, > > > > + uint16_t mpool_iter_step, struct rte_mbuf > > > > **srcs, > > > struct rte_mbuf > > > > +**dsts) { > > > > + int64_t async_cnt = 0; > > > > + int nr_cpl = 0; > > > > + uint32_t index; > > > > + uint16_t offset; > > > > + uint32_t i; > > > > + > > > > + for (offset = 0; offset < mpool_iter_step; offset++) { > > > > + for (i = 0; index = i * mpool_iter_step + offset, index > > > > < > > > > +nr_buf; > > > i++) { > > > > + if (unlikely(rte_dma_copy(dev_id, > > > > + 0, > > > > + srcs[index]->buf_iova + > > > srcs[index]->data_off, > > > > + dsts[index]->buf_iova + > > > dsts[index]->data_off, > > > > + buf_size, > > > > + 0) < 0)) { > > > > + rte_dma_submit(dev_id, 0); > > > > + while (rte_dma_burst_capacity(dev_id, > > > > 0) == > 0) > > > { > > > > + nr_cpl = > > > > rte_dma_completed(dev_id, > 0, > > > MAX_DMA_CPL_NB, > > > > + NULL, > > > > NULL); > > > > + async_cnt -= nr_cpl; > > > > + } > > > > + if (rte_dma_copy(dev_id, > > > > + 0, > > > > + srcs[index]->buf_iova + > > > srcs[index]->data_off, > > > > + dsts[index]->buf_iova + > > > dsts[index]->data_off, > > > > + buf_size, > > > > + 0) < 0) { > > > > + printf("enqueue fail again at > > > > %u\n", > > > index); > > > > + printf("space:%d\n", > > > rte_dma_burst_capacity(dev_id, 0)); > > > > + rte_exit(EXIT_FAILURE, "DMA > enqueue > > > failed\n"); > > > > + } > > > > + } > > > > + async_cnt++; > > > > + > > > > + /** > > > > + * When '&' is used to wrap an index, mask must > > > > be a > > > power of 2. > > > > + * That is, kick_batch must be 2^n. > > > > + */ > > > > + if (unlikely((async_cnt % kick_batch) == 0)) { > > > > + rte_dma_submit(dev_id, 0); > > > > + /* add a poll to avoid ring full */ > > > > + nr_cpl = rte_dma_completed(dev_id, 0, > > > MAX_DMA_CPL_NB, NULL, NULL); > > > > + async_cnt -= nr_cpl; > > > > + } > > > > + } > > > > + > > > > + rte_dma_submit(dev_id, 0); > > > > + while (async_cnt > 0) { > > > > + nr_cpl = rte_dma_completed(dev_id, 0, > > > MAX_DMA_CPL_NB, NULL, NULL); > > > > + async_cnt -= nr_cpl; > > > > + } > > > > > > I have a couple of concerns about the methodology for testing the HW > > > DMA performance. For example, the inclusion of that final block > > > means that we are including the latency of the copy operation in the > result. > > > > > > If the objective of the test application is to determine if it is > > > cheaper for software to offload a copy operation to HW or do it in > > > SW, then the primary concern is the HW offload cost. That offload > > > cost should remain constant irrespective of the size of the copy - > > > since all you are doing is writing a descriptor and reading a > > > completion result. However, seeing the results of running the app, I > > > notice that the reported average cycles increases as the packet size > > > increases, which would tend to indicate that we are not giving a realistic > measurement of offload cost. > > > > We are trying to compare the time required to complete a certain > > amount of work using DMA with the time required to complete it using > > CPU. I think in addition to the offload cost, the capability of the DMA > > itself > is also an important factor to be considered. > > The offload cost should be constant , but when DMA copies memory of > > different lengths, the time costs are different. So the reported average > cycles increases as the packet size increases. > > Therefore, this test result includes both offload cost and DMA > > operation cost. To some extent, it should be a relative realistic > measurement result. > > > > Do you think it makes sense to you? > > > > Hi, > > Yes, I get your point about the job latency being different when the > packet/copy sizes increase, but on the other hand, as I state above the actual > cycle cost to the application should not increase. If any application is doing > what this test app is doing, just sitting around waiting for job completion > (in > the fast path), then it is likely that the programmer should look at improving > the offload into the app. > > The main issue here is that by outputting a single number, you are mixing > two separate values - both offload cost and job latency. If you want to show > the effects of larger/smaller packets on both, then you should output both > values separately. For most applications where you will offload copies and do > other work while the copy is being done, the offload cost is of primary > concern. For some applications the latency figure may also be important, but > in those cases the user will want to see the latency called out explicitly, > not > just mixed up in a single figure with offload cost. Sure, makes sense to me, thanks. > > > > > > > The trouble then becomes how to do so in a more realistic manner. > > > The most accurate way I can think of in a unit test like this is to > > > offload <queue_size> entries to the device and measure the cycles > > > taken there. Then wait until such time as all copies are completed > > > (to eliminate the latency time, which in a real- world case would be > > > spent by a core doing something else), and then do a second > > > measurement of the time taken to process all the completions. In the > > > same way as for a SW copy, any time not spent in memcpy is not copy > > > time, for HW copies any time spent not writing descriptors or reading > completions is not part of the offload cost. > > > > Agreed, we are thinking about adding offload cost as one of test results in > the future. > > > > > > > > That said, doing the above is still not fully realistic, as a > > > real-world app will likely still have some amount of other overhead, > > > for example, polling occasionally for completions in between doing > > > other work (though one would expect this to be relatively cheap). > > > Similarly, if the submission queue fills, the app may have to delay > > > waiting for space to submit jobs, and therefore see some of the HW copy > latency. > > > > > > Therefore, I think the most realistic way to measure this is to look > > > at the rate of operations while processing is being done in the > > > middle of the test. For example, if we have a simple packet > > > processing application, running the application just doing RX and TX > > > and measuring the rate allows us to determine the basic packet I/O > > > cost. Adding in an offload to HW for each packet and again measuring > > > the rate, will then allow us to compute the true offload copy cost > > > of the operation, and should give us a number that remains flat even > > > as packet size increases. For previous work done on vhost with DMA > > > acceleration, I believe we saw exactly that - while SW PPS reduced as > packet size increased, with HW copies the PPS remained constant even as > packet size increased. > > > > > > The challenge to my mind, is therefore how to implement this in a > > > suitable unit- test style way, to fit into the framework you have > > > given here. I would suggest that the actual performance measurement > > > needs to be done - not on a total time - but on a fixed time basis > > > within each test. For example, when doing HW copies, 1ms into each > > > test run, we need to snapshot the completed entries, and then say > > > 1ms later measure the number that have been completed since. In this > > > way, we avoid the initial startup latency while we wait for jobs to start > completing, and we avoid the final latency as we await the last job to > complete. > > > We would also include time for some potentially empty polls, and if > > > a queue size is too small see that reflected in the performance too. > > > > I understand your concerns, but I think maybe we are not discussing the > same performance number here. > > We are trying to test the maximum bandwidth of DMA, and what you said > is how to measure the offload cost more accurately if I understand it > correctly. > > I think these two performance data are both important. Maybe we can > > add your test methodology as one of performance aspect for DMA in the > future, I need to reconsider it and get back to you later. > > > > Max bandwidth of HW is a third and separate number from that of offload- > cost and latency. Again, it should be measured and reported separately if you > want the app to provide it. OK, got it. We will try to implement such test method in the next version. Thanks. Cheng > > Regards, > > /Bruce