On Oct 18, 2006, at 3:57 PM, Pete Wyckoff wrote:

[EMAIL PROTECTED] wrote on Wed, 18 Oct 2006 13:38 -0500:
We do what seems like unnecessary conversion from the output of
PINT_process_request (offsets and sizes into a buffer) to the input
of BMI_post_send_list (array of memory pointers).  Could we change
the BMI_post_send_list interface to take offsets instead?

We need the list of buffers in case post_send_list wants to use bits
of memory from separate allocations.  But we could also pass in the
"parent buffer" if one exists, or all of them if multiple ones
exist.  This gets messy fast.

Another way to do it would be for the flow code to make a separate
call to BMI to say "here's the big buffer I'm using".  But then to
be completely correct it should probably say "I'm done with this big
buffer" later.

It's all a massive layering violation.  Calls out for some sort of
unified approach to buffer management that spans layers.

Here's what I'm sort of thinking now, but I'm not sure.  It kind of
curdles the stomach contents.  Tell me if you think of a better way.

Declare this:

        struct bmi_optimistic_buffer_info {
            const void *buffer;
            bmi_size_t len;
            enum PVFS_io_type rw;
        }

and hang one off the sm u.io.  Initialize in PVFS_isys_io by:

        sm_p->u.io.binfo.buffer = sm_p->u.io.buffer;
        sm_p->u.io.binfo.len = PINT_REQUEST_TOTAL_BYTES(sm_p->u.io.mem_req),
        sm_p->u.io.binfo.rw = sm_p->u.io.io_type;

Then in io_post_flow() somewhere just before the job_flow():

        BMI_set_info(cur_ctx->msg.svr_addr, BMI_OPTIMISTIC_BUFFER_REG,
                     &sm_p->u.io.binfo);

(called once for each server), and deep in
io_datafile_complete_operations(), near the IO_SM_PHASE_FLOW
handling, undo it for that particular server with:

        BMI_set_info(cur_ctx->msg.svr_addr, BMI_OPTIMISTIC_BUFFER_DEREG,
                     &sm_p->u.io.binfo);

Under the hood BMI can choose to preserve its actual registration
beyond the flow lifetimes of course.  The main benefit of doing this
will be that BMI can recognize that a particular buffer is part of
this bigger registration and avoid having the 900 separate little
64 kB registrations in favor of a single 36 MB registration.

Rob pointed out that even with one buffer pointer passed into PVFS_sys_io, if the request is non-contiguous, the offsets of the request could just have been calculated from different buffer pointers. So we don't know how many separate buffers are being used over the memory request, except that there's at most as many buffers as contiguous regions in the request. The PINT_process_request code (potentially) breaks those buffers up even further though, based on the distribution parameters (like strip size), before passing the pointers to bmi. It seems like we could do what you're suggesting, but we would have to do it per each contiguous region of the request. Maybe that's not such a big deal? Not sure...

-sam



                -- Pete


_______________________________________________
Pvfs2-developers mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers

Reply via email to