Petri and I have already given review comments on the PR, so thank you for that. Let's continue the motivation / need discussion here.
On Fri, Dec 8, 2017 at 6:43 AM, Michal Mazur <michal.ma...@linaro.org> wrote: > Created a PR: https://github.com/Linaro/odp/pull/331 > > If we want to use VPP+ODP on ARM it has to be optimized anyway. This APi > is not specific for x86 nor VPP. > Actually it is specific to certain types of ODP implementations. In the general case where an odp_packet_t is simply a HW token there is no obvious performance advantage offered by this API. The general solution to the VPP problem is to store the odp_packet_t in the user area itself where it may simply be retrieved as needed. The fact that VPP on x86 has a cache problem means that's the application-specific problem that should be solved, not trying to invent new ODP APIs. As Petri noted in his PR review, when odp_packet_t ownership is surrendered, so is ownership of any user area address, so these may not be cached and assumed to retain validity. The identity (and validity) of a user area address is tied to the underlying odp_packet_t handle. > It will provide faster (optimized for each implementation) method to get > the ODP handle which is required for every application using ODP. > What applications other than VPP are using the ODP now? How do they solve > this issue? > > On 8 December 2017 at 05:36, Bill Fischofer <bill.fischo...@linaro.org> > wrote: > >> >> >> On Thu, Dec 7, 2017 at 10:12 PM, Honnappa Nagarahalli < >> honnappa.nagaraha...@linaro.org> wrote: >> >>> On 7 December 2017 at 17:36, Bill Fischofer <bill.fischo...@linaro.org> >>> wrote: >>> > >>> > >>> > On Thu, Dec 7, 2017 at 3:17 PM, Honnappa Nagarahalli >>> > <honnappa.nagaraha...@linaro.org> wrote: >>> >> >>> >> This experiment clearly shows the need for providing an API in ODP. >>> >> >>> >> On ODP2.0 implementations such an API will be simple enough (constant >>> >> subtraction), requiring no additional storage in VLIB. >>> >> >>> >> Michal, can you send a PR to ODP for the API so that we can debate the >>> >> feasibility of the API for Cavium/NXP platforms. >>> > >>> > >>> > That's the point. An API that is tailored to a specific implementation >>> or >>> > application is not what ODP is about. >>> > >>> How are the requirements coming to ODP APIs currently? My >>> understanding is, it is coming from OFP and Petri's requirements. >>> Similarly, VPP is also an application of ODP. Recently, Arm community >>> (Arm and partners) prioritized on the open source projects that are of >>> importance and came up with top 50 (or 100) projects. If I remember >>> correct VPP is among top single digits (I am trying to get the exact >>> details). So, it is an application of significant interest. >>> >> >> VPP is important, but what's important is for VPP to perform >> significantly better on at least one ODP implementation than it does today >> using DPDK. If we can't demonstrate that then there's no point to the >> ODP4VPP project. That's not going to happen on x86 since we can assume that >> VPP/DPDK is optimal here since VPP has been tuned to DPDK internals. So we >> need to focus the performance work on Arm SoC platforms that offer >> significant HW acceleration capabilities that VPP can exploit via ODP4VPP. >> This isn't one of those. The claim is that with or without this change >> ODP4VPP on x86 performs worse than VPP/DPDK on x86. >> >> Since VPP applications don't change if ODP4VPP is in the picture or not, >> it doesn't matter whether it's used on x86, so tuning ODP4VPP on x86 is at >> best of secondary importance. We just need at least one Arm platform on >> which VPP applications run dramatically better than without it. >> >> >>> >>> >> >>> >> >>> >> On 7 December 2017 at 14:08, Bill Fischofer < >>> bill.fischo...@linaro.org> >>> >> wrote: >>> >> > On Thu, Dec 7, 2017 at 12:22 PM, Michal Mazur < >>> michal.ma...@linaro.org> >>> >> > wrote: >>> >> > >>> >> >> Native VPP+DPDK plugin knows the size of rte_mbuf header and >>> subtracts >>> >> >> it >>> >> >> from the vlib pointer. >>> >> >> >>> >> >> struct rte_mbuf *mb0 = rte_mbuf_from_vlib_buffer (b0); >>> >> >> #define rte_mbuf_from_vlib_buffer(x) (((struct rte_mbuf *)x) - 1) >>> >> >> >>> >> > >>> >> > No surprise that VPP is a DPDK application, but I thought they >>> wanted to >>> >> > be >>> >> > independent of DPDK. The problem is that ODP is never going to match >>> >> > DPDK >>> >> > at an ABI level on x86 so we can't be fixated on x86 performance >>> >> > comparisons between ODP4VPP and VPP/DPDK. >>> >> Any reason why we will not be able to match or exceed the performance? >>> > >>> > >>> > It's not that ODP can't have good performance on x86, it's that DPDK >>> > encourages apps to be very dependent on DPDK implementation details >>> such as >>> > seen here. ODP is not going to match DPDK internals so applications >>> that >>> > exploit such internals will always see a difference. >>> > >>> >> >>> >> >>> >> What we need to do is compare >>> >> > ODP4VPP on Arm-based SoCs vs. "native VPP" that can't take >>> advantage of >>> >> > the >>> >> > HW acceleration present on those platforms. That's how we get to >>> show >>> >> > dramatic differences. If ODP4VPP is only within a few percent (plus >>> or >>> >> > minus) of VPP/DPDK there's no point of doing the project at all. >>> >> > >>> >> > So my advice would be to stash the handle in the VLIB buffer for >>> now and >>> >> > focus on exploiting the native IPsec acceleration capabilities that >>> ODP >>> >> > will permit. >>> >> > >>> >> > >>> >> >> On 7 December 2017 at 19:02, Bill Fischofer < >>> bill.fischo...@linaro.org> >>> >> >> wrote: >>> >> >> >>> >> >>> Ping to others on the mailing list for opinions on this. What does >>> >> >>> "native" VPP+DPDK get and how is this problem solved there? >>> >> >>> >>> >> >>> On Thu, Dec 7, 2017 at 11:55 AM, Michal Mazur >>> >> >>> <michal.ma...@linaro.org> >>> >> >>> wrote: >>> >> >>> >>> >> >>>> The _odp_packet_inline is common for all packets and takes up to >>> two >>> >> >>>> cachelines (it contains only offsets). Reading pointer for each >>> >> >>>> packet from >>> >> >>>> VLIB would require to fetch 10 million cachelines per second. >>> >> >>>> Using prefetches does not help. >>> >> >>>> >>> >> >>>> On 7 December 2017 at 18:37, Bill Fischofer >>> >> >>>> <bill.fischo...@linaro.org> >>> >> >>>> wrote: >>> >> >>>> >>> >> >>>>> Yes, but _odp_packet_inline.udate is clearly not in the VLIB >>> cache >>> >> >>>>> line >>> >> >>>>> either, so it's a separate cache line access. Are you seeing >>> this >>> >> >>>>> difference in real runs or microbenchmarks? Why isn't the entire >>> >> >>>>> VLIB being >>> >> >>>>> prefetched at dispatch? Sequential prefetching should add >>> negligible >>> >> >>>>> overhead. >>> >> >>>>> >>> >> >>>>> On Thu, Dec 7, 2017 at 11:13 AM, Michal Mazur >>> >> >>>>> <michal.ma...@linaro.org> >>> >> >>>>> wrote: >>> >> >>>>> >>> >> >>>>>> It seems that only first cache line of VLIB buffer is in L1, >>> new >>> >> >>>>>> pointer can be placed only in second cacheline. >>> >> >>>>>> Using constant offset between user area and ODP header i get 11 >>> >> >>>>>> Mpps, >>> >> >>>>>> with pointer stored in VLIB buffer only 10Mpps and with this >>> new >>> >> >>>>>> api >>> >> >>>>>> 10.6Mpps. >>> >> >>>>>> >>> >> >>>>>> On 7 December 2017 at 18:04, Bill Fischofer >>> >> >>>>>> <bill.fischo...@linaro.org >>> >> >>>>>> > wrote: >>> >> >>>>>> >>> >> >>>>>>> How would calling an API be better than referencing the stored >>> >> >>>>>>> data >>> >> >>>>>>> yourself? A cache line reference is a cache line reference, >>> and >>> >> >>>>>>> presumably >>> >> >>>>>>> the VLIB buffer is already in L1 since it's your active data. >>> >> >>>>>>> >>> >> >>>>>>> On Thu, Dec 7, 2017 at 10:45 AM, Michal Mazur < >>> >> >>>>>>> michal.ma...@linaro.org> wrote: >>> >> >>>>>>> >>> >> >>>>>>>> Hi, >>> >> >>>>>>>> >>> >> >>>>>>>> For odp4vpp plugin we need a new API function which, given >>> user >>> >> >>>>>>>> area >>> >> >>>>>>>> pointer, will return a pointer to ODP packet buffer. It is >>> needed >>> >> >>>>>>>> when >>> >> >>>>>>>> packets processed by VPP are sent back to ODP and only a >>> pointer >>> >> >>>>>>>> to >>> >> >>>>>>>> VLIB >>> >> >>>>>>>> buffer data (stored inside user area) is known. >>> >> >>>>>>>> >>> >> >>>>>>>> I have tried to store the ODP buffer pointer in VLIB data but >>> >> >>>>>>>> reading it >>> >> >>>>>>>> for every packet lowers performance by 800kpps. >>> >> >>>>>>>> >>> >> >>>>>>>> For odp-dpdk implementation it can look like: >>> >> >>>>>>>> /** @internal Inline function @param uarea @return */ >>> >> >>>>>>>> static inline odp_packet_t _odp_packet_from_user_area(void >>> >> >>>>>>>> *uarea) >>> >> >>>>>>>> { >>> >> >>>>>>>> return (odp_packet_t)((uintptr_t)uarea - >>> >> >>>>>>>> _odp_packet_inline.udata); >>> >> >>>>>>>> } >>> >> >>>>>>>> >>> >> >>>>>>>> Please let me know what you think. >>> >> >>>>>>>> >>> >> >>>>>>>> Thanks, >>> >> >>>>>>>> Michal >>> >> >>>>>>>> >>> >> >>>>>>> >>> >> >>>>>>> >>> >> >>>>>> >>> >> >>>>> >>> >> >>>> >>> >> >>> >>> >> >> >>> > >>> > >>> >> >> >