Hi Jeff,
> Does your llp sed path order MPI matching ordering? Eg if some prior isend
> is already queued, could the llp send overtake it?
Yes, LLP send may overtake queued isend.
But we use correct PML send_sequence. So the LLP message is queued as
unexpected message on receiver side, and I think it's no problem.
> > rc = MCA_LLP_CALL(send(buf, size, OMPI_PML_OB1_MATCH_HDR_LEN,
> > (bool)OMPI_ENABLE_OB1_PAD_MATCH_HDR,
> > ompi_comm_peer_lookup(comm, dst),
> > MCA_PML_OB1_HDR_TYPE_MATCH));
> >
> > if (rc == OMPI_SUCCESS) {
> > /* NOTE this is not thread safe */
> > OPAL_THREAD_ADD32(&proc->send_sequence, 1);
> > }
Takahiro Kawashima,
MPI development team,
Fujitsu
> Does your llp sed path order MPI matching ordering? Eg if some prior isend
> is already queued, could the llp send overtake it?
>
> Sent from my phone. No type good.
>
> On Jun 29, 2011, at 8:27 AM, "Kawashima" <[email protected]> wrote:
>
> > Hi Jeff,
> >
> >>> First, we created a new BTL component, 'tofu BTL'. It's not so special
> >>> one but dedicated to our Tofu interconnect. But its latency was not
> >>> enough for us.
> >>>
> >>> So we created a new framework, 'LLP', and its component, 'tofu LLP'.
> >>> It bypasses request object creation in PML and BML/BTL, and sends
> >>> a message immediately if possible.
> >>
> >> Gotcha. Was the sendi pml call not sufficient? (sendi = "send
> >> immediate") This call was designed to be part of a latency reduction
> >> mechanism. I forget offhand what we don't do before calling sendi, but
> >> the rationale was that if the message was small enough, we could skip some
> >> steps in the sending process and "just send it."
> >
> > I know sendi, but its latency was not sufficient for us.
> > To come at sendi call, we must do:
> > - allocate send request (MCA_PML_OB1_SEND_REQUEST_ALLOC)
> > - initialize send request (MCA_PML_OB1_SEND_REQUEST_INIT)
> > - select BTL module (mca_pml_ob1_send_request_start)
> > - select protocol (mca_pml_ob1_send_request_start_btl)
> > We want to eliminate these overheads. We want to send more immediately.
> >
> > Here is a code snippet:
> >
> > ------------------------------------------------
> >
> > #if OMPI_ENABLE_LLP
> > static inline int mca_pml_ob1_call_llp_send(void *buf,
> > size_t size,
> > int dst,
> > int tag,
> > ompi_communicator_t *comm)
> > {
> > int rc;
> > mca_pml_ob1_comm_proc_t *proc = &comm->c_pml_comm->procs[dst];
> > mca_pml_ob1_match_hdr_t *match = mca_pml_ob1.llp_send_buf;
> >
> > match->hdr_common.hdr_type = MCA_PML_OB1_HDR_TYPE_MATCH;
> > match->hdr_common.hdr_flags = 0;
> > match->hdr_ctx = comm->c_contextid;
> > match->hdr_src = comm->c_my_rank;
> > match->hdr_tag = tag;
> > match->hdr_seq = proc->send_sequence + 1;
> >
> > rc = MCA_LLP_CALL(send(buf, size, OMPI_PML_OB1_MATCH_HDR_LEN,
> > (bool)OMPI_ENABLE_OB1_PAD_MATCH_HDR,
> > ompi_comm_peer_lookup(comm, dst),
> > MCA_PML_OB1_HDR_TYPE_MATCH));
> >
> > if (rc == OMPI_SUCCESS) {
> > /* NOTE this is not thread safe */
> > OPAL_THREAD_ADD32(&proc->send_sequence, 1);
> > }
> >
> > return rc;
> > }
> > #endif
> >
> > int mca_pml_ob1_send(void *buf,
> > size_t count,
> > ompi_datatype_t * datatype,
> > int dst,
> > int tag,
> > mca_pml_base_send_mode_t sendmode,
> > ompi_communicator_t * comm)
> > {
> > int rc;
> > mca_pml_ob1_send_request_t *sendreq;
> >
> > #if OMPI_ENABLE_LLP
> > /* try to send message via LLP if
> > * - one of LLP modules is available, and
> > * - datatype is basic, and
> > * - data is small, and
> > * - communication mode is standard, buffered, or ready, and
> > * - destination is not myself
> > */
> > if (((datatype->flags & DT_FLAG_BASIC) == DT_FLAG_BASIC) &&
> > (datatype->size * count < mca_pml_ob1.llp_max_payload_size) &&
> > (sendmode == MCA_PML_BASE_SEND_STANDARD ||
> > sendmode == MCA_PML_BASE_SEND_BUFFERED ||
> > sendmode == MCA_PML_BASE_SEND_READY) &&
> > (dst != comm->c_my_rank)) {
> > rc = mca_pml_ob1_call_llp_send(buf, datatype->size * count, dst,
> > tag, comm);
> > if (rc != OMPI_ERR_NOT_AVAILABLE) {
> > /* successfully sent out via LLP or unrecoverable error occurred
> > */
> > return rc;
> > }
> > }
> > #endif
> >
> > MCA_PML_OB1_SEND_REQUEST_ALLOC(comm, dst, sendreq, rc);
> > if (rc != OMPI_SUCCESS)
> > return rc;
> >
> > MCA_PML_OB1_SEND_REQUEST_INIT(sendreq,
> > buf,
> > count,
> > datatype,
> > dst, tag,
> > comm, sendmode, false);
> >
> > PERUSE_TRACE_COMM_EVENT (PERUSE_COMM_REQ_ACTIVATE,
> > &(sendreq)->req_send.req_base,
> > PERUSE_SEND);
> >
> > MCA_PML_OB1_SEND_REQUEST_START(sendreq, rc);
> > if (rc != OMPI_SUCCESS) {
> > MCA_PML_OB1_SEND_REQUEST_RETURN( sendreq );
> > return rc;
> > }
> >
> > ompi_request_wait_completion(&sendreq->req_send.req_base.req_ompi);
> >
> > rc = sendreq->req_send.req_base.req_ompi.req_status.MPI_ERROR;
> > ompi_request_free( (ompi_request_t**)&sendreq );
> > return rc;
> > }
> >
> > ------------------------------------------------
> >
> > mca_pml_ob1_send is body of MPI_Send in Open MPI. Region of
> > OMPI_ENABLE_LLP is added by us.
> >
> > We don't have to use a send request if we could "send immediately".
> > So we try to send via LLP at first. If LLP could not send immediately
> > because of interconnect busy or something, LLP returns
> > OMPI_ERR_NOT_AVAILABLE, and we continue normal PML/BML/BTL send(i).
> > Since we want to use simple memcpy instead of complex convertor,
> > we restrict datatype that can go into the LLP.
> >
> > Of course, we cannot use LLP on MPI_Isend.
> >
> >> Note, too, that the coll modules can be laid overtop of each other --
> >> e.g., if you only implement barrier (and some others) in tofu coll, then
> >> you can supply NULL for the other function pointers and the coll base will
> >> resolve those functions to other coll modules automatically.
> >
> > Thanks for the info. I've read mca_coll_base_comm_select() and understood.
> > Our implementation was bad.