On Sun, May 31, 2015 at 10:00 PM, Doug Ledford <[email protected]> wrote:
> On Sun, 2015-05-31 at 15:14 +0300, Or Gerlitz wrote:
>> Hi Doug,
>>
>> This patchset adds completion timestamping supports for verbs consumers.
>>
>> Reviewing the weekend threads, we've changed the flag time to reflect
>> that this is completion time-stamp and folded the mlx4 actual support
>> into one patch.
>>
>> Regarding the related user-space support, it's possible to add what you
>> were suggesting, ibv_get_raw_cqe_timestamp() -- returns ts in cycles and
>> ibv_get_cqe_timestamp() -- returns ts in ns, this makes the value returned
>> by the poll cq verb an opaque one that must go through one of  the 
>> convertors.
>>
>> We would to go for one helper ibv_get_timestamp(uint64_t raw_time, flag) 
>> which
>> could get the raw time-stamp and one of the following flags: RAW_TIME, 
>> RAW_NS_TIME.
>
> I'm theoretically OK with something similar to the above.  However, the
> NS time should not be raw.  It should be cooked and should be able to be
> valid to compare between different adapters.  Right now, the cycle
> counter that you are exposing is only useful for ordering between
> packets received on a single adapter where the cycle counter is the same
> on all packets.  Throw in a different vendor's card, or two of your own
> cards, and the issue gets much more complex.  The cooked value should be
> an actual, real time that can be used across these more complex
> environments.  Because of that, it really shouldn't be called RAW.
>

Thanks for the feedback Doug.
We wanted to add RAW_NS in order to free the user from calculating it by himself
(dividing the cycles value in the core_clock).
In addition to this, it's possible to implement a future NS_TIME
(without the "raw"), which
will convert the opaque time to system wide ns.

> So, if you want a single entry point, I would suggest something like
> this:
>
> enum ib_timestamp_flags {
>         IB_TIMESTAMP_COMPLETION = (1 << 0), // specify on create_cq
>         IB_TIMESTAMP_WQE_BEGIN =  (1 << 1), // specify on create qp?
>         IB_TIMESTAMP_WQE_END =    (1 << 2), // specify on create qp?
>         IB_TIMESTAMP_RAW =        (1 << 31)
> };
>
> enum ib_cq_creation_flags {
>         IB_CQ_FLAGS_TIMESTAMP_COMPLETION = (1 << 0)
> };
>
> /**
>  * ibv_get_timestamp - Return the requested timestamp for the given wc
>  * @wc - work completion to get timestamp results from
>  * @ts - struct timespec to return timestamp in
>  * @flags - which timestamp to return and in what form
>  *
>  * Depending on the flags used to create the queue pair/completion
>  * queue, different timestamps might be available.  Callers should
>  * specify which timestamp they are interested in using the flags
>  * element, and if they wish either a cooked or raw timestamp.  A
>  * raw timestamp is implementation defined and will be passed back
>  * in the tv_nsec portion of the struct timespec.  A raw timestamp
>  * can not be relied upon to have any ordering value between more
>  * than one HCA or driver.  A cooked timestamp will return a valid
>  * struct timespec normalized as closely as possible to the return
>  * value for CLOCK_MONOTONIC of clock_gettime at the time of the
>  * timestamp.
>  */
> int ibv_get_timestamp(struct ibv_wc *wc, struct timespec *ts, int
> flags);
>

We wanted to divide the flow here:
In create_cq, the user notifies the kernel/HCA which timestamp he
would like to get.
It could be a completion timestamp, a start of WQE timestamp or
whatever he wants.
The timestamp the user gets in the WQE is opaque. Every vendor could
implement it
as it wants - in order to have minimal implication in performance.

The second part is ibv_get_timestamp. It gets an opaque timestamp and
outputs a converted value in respect to the time the user wanted to get.
For example, if IB_TIMESTAMP_NS_TIME is given, the function should output
a system-wide NS value (we would like to implement this only in the future).
Currently, only RAW and RAW_NS will be supported, while RAW gives the time
in cycles and RAW_NS gives a NS value with an unknown time reference.

We think ibv_get_timestamp shouldn't get a wqe but a 64bit opaque value.
The reason for this is that it could be used in order to translate query_values
current time to different types of timestamp.
What do you think?

>> We think this would address the reviewer comments for the kernel submission.
>>
>> The user-space code is in (still uses IB_CQ_FLAGS_TIMESTAMP and miss the
>> conversion functions)
>>
>>  https://github.com/matanb10/libibverbs timestamp-v1
>>  https://github.com/matanb10/libmlx4 timestamp-v1
>>
>> Timestamping is used by applications in order to know when a WQE was
>> received/transmitted by the HW. The value is given is HCA hardware cycles,
>> but could be easily converted as the hardware's core clock frequecny is
>> available through extension of query device.
>>
>> Moreover, we add an ability to read the HCA's current clock. This could be
>> useful on order to synchronize events to the wall clock.
>>
>> This functionality is achieved by adding/extending the following verbs:
>>
>> create_cq - create_cq is extended in order to allow passing creation flags
>> to the CQ creation function. We change IB/core --> vendors API
>> to be easily extendible by passing a struct which contains
>> comp_vectors, cqe and the new flags parameter. In order to create
>> CQ which supports timestamping, IB_CQ_FLAGS_TIMESTAMP_COMPLETION should be 
>> given.
>>
>> query_device - We extend query_device uverb further by giving the hardware's
>> clock frequency and the timestamp mask (the number of timestamp
>> bits which are supported). If timestamp isn't supported, 0 is returned.
>>
>> In order to read the timestamp in the WQE, the user needs to query the device
>> for support, create an appropriate CQ (using the extanded uverb with
>> IB_CQ_FLAGS_TIMESTAMP_COMPLETION) and poll the CQ with an extended poll_cq 
>> verb (currently,
>> only implemented in user-space).
>>
>> In mlx4, allowing the user to read the core clock efficiently involves 
>> mapping
>> this area of the hardware to user-space (being done by using a mmap command)
>> and reading the clock from the correct offset of the page.
>>
>> This offset is returned in the vendor's specific data from mlx4's kernel 
>> driver
>> to the mlx4's user-space driver. query_device is modified in order to support
>> passing this vendor specific data. A user-space application could use a new
>> verb in order to read the hardware's clock.
>>
>> Translating the hardware's clock into ms could be done by dividing this
>> value by hca_core_clock (which is returned by the extended version of
>> query_device uverb).
>>
>> A user-space application could get the current HW's clock by executing
>>
>> ibv_query_values_ex(struct ibv_context *context, uint32_t q_values,
>>                     struct ibv_values_ex *values)
>>
>> The function gets a mask of the values to query and return their values.
>> Vendors could either implement this as a uverb command or use their
>> user-space driver to return those values directly from the HW (the mlx4 way).
>>
>> Matan and Or.
>>
>> Changes from V1:
>>  (1) fixed lustre IB's code build
>>  (2) squashed mlx4 V1 9-11 patches into one
>>  (3) changed IB_CQ_FLAGS_TIMESTAMP --> IB_CQ_FLAGS_TIMESTAMP_COMPLETION
>>
>> Changes from V0:
>> (1) Pass ib_cq_init_attr instead of cqe and comp_vector.
>> (2) Fix unneeded indentation.
>> (3) Change flags to u32.
>> (4) Add const to create_cq's ib_cq_init_attr argument in vendor 
>> implementation.
>>
>> Matan Barak (9):
>>   IB/core: Change provider's API of create_cq to be extendible
>>   IB/core: Change ib_create_cq to use struct ib_cq_init_attr
>>   IB/core: Add CQ creation time-stamping flag
>>   IB/core: Extend ib_uverbs_create_cq
>>   IB/core: Add timestamp_mask and hca_core_clock to query_device
>>   IB/core: Pass hardware specific data in query_device
>>   IB/mlx4: Add mmap call to map the hardware clock
>>   IB/mlx4: Support extended create_cq and query_device uverbs
>>   IB/mlx4: Add support for CQ time-stamping
>>
>>  drivers/infiniband/core/device.c                   |    6 +-
>>  drivers/infiniband/core/mad.c                      |    5 +-
>>  drivers/infiniband/core/uverbs.h                   |    1 +
>>  drivers/infiniband/core/uverbs_cmd.c               |  188 
>> ++++++++++++++++----
>>  drivers/infiniband/core/uverbs_main.c              |    1 +
>>  drivers/infiniband/core/verbs.c                    |    4 +-
>>  drivers/infiniband/hw/amso1100/c2_provider.c       |   14 ++-
>>  drivers/infiniband/hw/cxgb3/iwch_provider.c        |   19 ++-
>>  drivers/infiniband/hw/cxgb4/cq.c                   |    9 +-
>>  drivers/infiniband/hw/cxgb4/iw_cxgb4.h             |    8 +-
>>  drivers/infiniband/hw/cxgb4/provider.c             |    8 +-
>>  drivers/infiniband/hw/ehca/ehca_cq.c               |    7 +-
>>  drivers/infiniband/hw/ehca/ehca_hca.c              |    6 +-
>>  drivers/infiniband/hw/ehca/ehca_iverbs.h           |    6 +-
>>  drivers/infiniband/hw/ehca/ehca_main.c             |    6 +-
>>  drivers/infiniband/hw/ipath/ipath_cq.c             |    9 +-
>>  drivers/infiniband/hw/ipath/ipath_verbs.c          |    7 +-
>>  drivers/infiniband/hw/ipath/ipath_verbs.h          |    3 +-
>>  drivers/infiniband/hw/mlx4/cq.c                    |   13 ++-
>>  drivers/infiniband/hw/mlx4/mad.c                   |    5 +-
>>  drivers/infiniband/hw/mlx4/main.c                  |   67 +++++++-
>>  drivers/infiniband/hw/mlx4/mlx4_ib.h               |   19 ++-
>>  drivers/infiniband/hw/mlx5/cq.c                    |   10 +-
>>  drivers/infiniband/hw/mlx5/main.c                  |   19 ++-
>>  drivers/infiniband/hw/mlx5/mlx5_ib.h               |    5 +-
>>  drivers/infiniband/hw/mthca/mthca_provider.c       |   15 ++-
>>  drivers/infiniband/hw/nes/nes_verbs.c              |   17 ++-
>>  drivers/infiniband/hw/ocrdma/ocrdma_verbs.c        |   13 ++-
>>  drivers/infiniband/hw/ocrdma/ocrdma_verbs.h        |    9 +-
>>  drivers/infiniband/hw/qib/qib_cq.c                 |   11 +-
>>  drivers/infiniband/hw/qib/qib_verbs.c              |    6 +-
>>  drivers/infiniband/hw/qib/qib_verbs.h              |    5 +-
>>  drivers/infiniband/hw/usnic/usnic_ib_verbs.c       |   16 ++-
>>  drivers/infiniband/hw/usnic/usnic_ib_verbs.h       |   10 +-
>>  drivers/infiniband/ulp/ipoib/ipoib_verbs.c         |    9 +-
>>  drivers/infiniband/ulp/iser/iser_verbs.c           |    6 +-
>>  drivers/infiniband/ulp/isert/ib_isert.c            |    6 +-
>>  drivers/infiniband/ulp/srp/ib_srp.c                |   10 +-
>>  drivers/infiniband/ulp/srpt/ib_srpt.c              |    5 +-
>>  drivers/net/ethernet/mellanox/mlx4/main.c          |   19 ++
>>  .../staging/lustre/lnet/klnds/o2iblnd/o2iblnd.c    |    7 +-
>>  include/linux/mlx4/device.h                        |    9 +
>>  include/rdma/ib_verbs.h                            |   25 ++-
>>  include/uapi/rdma/ib_user_verbs.h                  |   19 ++
>>  net/9p/trans_rdma.c                                |    5 +-
>>  net/rds/ib_cm.c                                    |    8 +-
>>  net/rds/iw_cm.c                                    |    8 +-
>>  net/sunrpc/xprtrdma/svc_rdma_transport.c           |   10 +-
>>  net/sunrpc/xprtrdma/verbs.c                        |   10 +-
>>  49 files changed, 564 insertions(+), 139 deletions(-)
>>
>
>
> --
> Doug Ledford <[email protected]>
>               GPG KeyID: 0E572FDD
>

Thanks for taking a look.

Matan
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to