Petri and I have already given review comments on the PR, so thank you for
that. Let's continue the motivation / need discussion here.

On Fri, Dec 8, 2017 at 6:43 AM, Michal Mazur <michal.ma...@linaro.org>
wrote:

> Created a PR: https://github.com/Linaro/odp/pull/331
>
> If we want to use VPP+ODP on ARM it has to be optimized anyway. This APi
> is not specific for x86 nor VPP.
>

Actually it is specific to certain types of ODP implementations. In the
general case where an odp_packet_t is simply a HW token there is no obvious
performance advantage offered by this API. The general solution to the VPP
problem is to store the odp_packet_t in the user area itself where it may
simply be retrieved as needed. The fact that VPP on x86 has a cache problem
means that's the application-specific problem that should be solved, not
trying to invent new ODP APIs. As Petri noted in his PR review, when
odp_packet_t ownership is surrendered, so is ownership of any user area
address, so these may not be cached and assumed to retain validity. The
identity (and validity) of a user area address is tied to the underlying
odp_packet_t handle.


> It will provide faster (optimized for each implementation) method to get
> the ODP handle which is required for every application using ODP.
> What applications other than VPP are using the ODP now? How do they solve
> this issue?
>
> On 8 December 2017 at 05:36, Bill Fischofer <bill.fischo...@linaro.org>
> wrote:
>
>>
>>
>> On Thu, Dec 7, 2017 at 10:12 PM, Honnappa Nagarahalli <
>> honnappa.nagaraha...@linaro.org> wrote:
>>
>>> On 7 December 2017 at 17:36, Bill Fischofer <bill.fischo...@linaro.org>
>>> wrote:
>>> >
>>> >
>>> > On Thu, Dec 7, 2017 at 3:17 PM, Honnappa Nagarahalli
>>> > <honnappa.nagaraha...@linaro.org> wrote:
>>> >>
>>> >> This experiment clearly shows the need for providing an API in ODP.
>>> >>
>>> >> On ODP2.0 implementations such an API will be simple enough (constant
>>> >> subtraction), requiring no additional storage in VLIB.
>>> >>
>>> >> Michal, can you send a PR to ODP for the API so that we can debate the
>>> >> feasibility of the API for Cavium/NXP platforms.
>>> >
>>> >
>>> > That's the point. An API that is tailored to a specific implementation
>>> or
>>> > application is not what ODP is about.
>>> >
>>> How are the requirements coming to ODP APIs currently? My
>>> understanding is, it is coming from OFP and Petri's requirements.
>>> Similarly, VPP is also an application of ODP. Recently, Arm community
>>> (Arm and partners) prioritized on the open source projects that are of
>>> importance and came up with top 50 (or 100) projects. If I remember
>>> correct VPP is among top single digits (I am trying to get the exact
>>> details). So, it is an application of significant interest.
>>>
>>
>> VPP is important, but what's important is for VPP to perform
>> significantly better on at least one ODP implementation than it does today
>> using DPDK. If we can't demonstrate that then there's no point to the
>> ODP4VPP project. That's not going to happen on x86 since we can assume that
>> VPP/DPDK is optimal here since VPP has been tuned to DPDK internals. So we
>> need to focus the performance work on Arm SoC platforms that offer
>> significant HW acceleration capabilities that VPP can exploit via ODP4VPP.
>> This isn't one of those. The claim is that with or without this change
>> ODP4VPP on x86 performs worse than VPP/DPDK on x86.
>>
>> Since VPP applications don't change if ODP4VPP is in the picture or not,
>> it doesn't matter whether it's used on x86, so tuning ODP4VPP on x86 is at
>> best of secondary importance. We just need at least one Arm platform on
>> which VPP applications run dramatically better than without it.
>>
>>
>>>
>>> >>
>>> >>
>>> >> On 7 December 2017 at 14:08, Bill Fischofer <
>>> bill.fischo...@linaro.org>
>>> >> wrote:
>>> >> > On Thu, Dec 7, 2017 at 12:22 PM, Michal Mazur <
>>> michal.ma...@linaro.org>
>>> >> > wrote:
>>> >> >
>>> >> >> Native VPP+DPDK plugin knows the size of rte_mbuf header and
>>> subtracts
>>> >> >> it
>>> >> >> from the vlib pointer.
>>> >> >>
>>> >> >> struct rte_mbuf *mb0 = rte_mbuf_from_vlib_buffer (b0);
>>> >> >> #define rte_mbuf_from_vlib_buffer(x) (((struct rte_mbuf *)x) - 1)
>>> >> >>
>>> >> >
>>> >> > No surprise that VPP is a DPDK application, but I thought they
>>> wanted to
>>> >> > be
>>> >> > independent of DPDK. The problem is that ODP is never going to match
>>> >> > DPDK
>>> >> > at an ABI level on x86 so we can't be fixated on x86 performance
>>> >> > comparisons between ODP4VPP and VPP/DPDK.
>>> >> Any reason why we will not be able to match or exceed the performance?
>>> >
>>> >
>>> > It's not that ODP can't have good performance on x86, it's that DPDK
>>> > encourages apps to be very dependent on DPDK implementation details
>>> such as
>>> > seen here. ODP is not going to match DPDK internals so applications
>>> that
>>> > exploit such internals will always see a difference.
>>> >
>>> >>
>>> >>
>>> >> What we need to do is compare
>>> >> > ODP4VPP on Arm-based SoCs vs. "native VPP" that can't take
>>> advantage of
>>> >> > the
>>> >> > HW acceleration present on those platforms. That's how we get to
>>> show
>>> >> > dramatic differences. If ODP4VPP is only within a few percent (plus
>>> or
>>> >> > minus) of VPP/DPDK there's no point of doing the project at all.
>>> >> >
>>> >> > So my advice would be to stash the handle in the VLIB buffer for
>>> now and
>>> >> > focus on exploiting the native IPsec acceleration capabilities that
>>> ODP
>>> >> > will permit.
>>> >> >
>>> >> >
>>> >> >> On 7 December 2017 at 19:02, Bill Fischofer <
>>> bill.fischo...@linaro.org>
>>> >> >> wrote:
>>> >> >>
>>> >> >>> Ping to others on the mailing list for opinions on this. What does
>>> >> >>> "native" VPP+DPDK get and how is this problem solved there?
>>> >> >>>
>>> >> >>> On Thu, Dec 7, 2017 at 11:55 AM, Michal Mazur
>>> >> >>> <michal.ma...@linaro.org>
>>> >> >>> wrote:
>>> >> >>>
>>> >> >>>> The _odp_packet_inline is common for all packets and takes up to
>>> two
>>> >> >>>> cachelines (it contains only offsets). Reading pointer for each
>>> >> >>>> packet from
>>> >> >>>> VLIB would require to fetch 10 million cachelines per second.
>>> >> >>>> Using prefetches does not help.
>>> >> >>>>
>>> >> >>>> On 7 December 2017 at 18:37, Bill Fischofer
>>> >> >>>> <bill.fischo...@linaro.org>
>>> >> >>>> wrote:
>>> >> >>>>
>>> >> >>>>> Yes, but _odp_packet_inline.udate is clearly not in the VLIB
>>> cache
>>> >> >>>>> line
>>> >> >>>>> either, so it's a separate cache line access. Are you seeing
>>> this
>>> >> >>>>> difference in real runs or microbenchmarks? Why isn't the entire
>>> >> >>>>> VLIB being
>>> >> >>>>> prefetched at dispatch? Sequential prefetching should add
>>> negligible
>>> >> >>>>> overhead.
>>> >> >>>>>
>>> >> >>>>> On Thu, Dec 7, 2017 at 11:13 AM, Michal Mazur
>>> >> >>>>> <michal.ma...@linaro.org>
>>> >> >>>>> wrote:
>>> >> >>>>>
>>> >> >>>>>> It seems that only first cache line of VLIB buffer is in L1,
>>> new
>>> >> >>>>>> pointer can be placed only in second cacheline.
>>> >> >>>>>> Using constant offset between user area and ODP header i get 11
>>> >> >>>>>> Mpps,
>>> >> >>>>>> with pointer stored in VLIB buffer only 10Mpps and with this
>>> new
>>> >> >>>>>> api
>>> >> >>>>>> 10.6Mpps.
>>> >> >>>>>>
>>> >> >>>>>> On 7 December 2017 at 18:04, Bill Fischofer
>>> >> >>>>>> <bill.fischo...@linaro.org
>>> >> >>>>>> > wrote:
>>> >> >>>>>>
>>> >> >>>>>>> How would calling an API be better than referencing the stored
>>> >> >>>>>>> data
>>> >> >>>>>>> yourself? A cache line reference is a cache line reference,
>>> and
>>> >> >>>>>>> presumably
>>> >> >>>>>>> the VLIB buffer is already in L1 since it's your active data.
>>> >> >>>>>>>
>>> >> >>>>>>> On Thu, Dec 7, 2017 at 10:45 AM, Michal Mazur <
>>> >> >>>>>>> michal.ma...@linaro.org> wrote:
>>> >> >>>>>>>
>>> >> >>>>>>>> Hi,
>>> >> >>>>>>>>
>>> >> >>>>>>>> For odp4vpp plugin we need a new API function which, given
>>> user
>>> >> >>>>>>>> area
>>> >> >>>>>>>> pointer, will return a pointer to ODP packet buffer. It is
>>> needed
>>> >> >>>>>>>> when
>>> >> >>>>>>>> packets processed by VPP are sent back to ODP and only a
>>> pointer
>>> >> >>>>>>>> to
>>> >> >>>>>>>> VLIB
>>> >> >>>>>>>> buffer data (stored inside user area) is known.
>>> >> >>>>>>>>
>>> >> >>>>>>>> I have tried to store the ODP buffer pointer in VLIB data but
>>> >> >>>>>>>> reading it
>>> >> >>>>>>>> for every packet lowers performance by 800kpps.
>>> >> >>>>>>>>
>>> >> >>>>>>>> For odp-dpdk implementation it can look like:
>>> >> >>>>>>>> /** @internal Inline function @param uarea @return */
>>> >> >>>>>>>> static inline odp_packet_t _odp_packet_from_user_area(void
>>> >> >>>>>>>> *uarea)
>>> >> >>>>>>>> {
>>> >> >>>>>>>>        return (odp_packet_t)((uintptr_t)uarea -
>>> >> >>>>>>>> _odp_packet_inline.udata);
>>> >> >>>>>>>> }
>>> >> >>>>>>>>
>>> >> >>>>>>>> Please let me know what you think.
>>> >> >>>>>>>>
>>> >> >>>>>>>> Thanks,
>>> >> >>>>>>>> Michal
>>> >> >>>>>>>>
>>> >> >>>>>>>
>>> >> >>>>>>>
>>> >> >>>>>>
>>> >> >>>>>
>>> >> >>>>
>>> >> >>>
>>> >> >>
>>> >
>>> >
>>>
>>
>>
>

Reply via email to