On Tue, 8 Nov 2011 06:36:03 -0800, Rolf vandeVaart <rvandeva...@nvidia.com> wrote: >> george. >> >>PS: Regarding the hand-copy instead of the memcpy, we tried to avoid > using >>memcpy in performance critical codes, especially when we know the size of >>the data and the alignment. This relieves the compiler of adding ugly > intrinsics, >>allowing it to nicely pipeline to load/stores. Anyway, with both > approaches >>you will copy more data than needed for all BTLs except uGNI. > > I was looking at a case in a BTL I was working on where I actually need 64 > bytes (yes, bytes) as the remote key size as opposed to the current 16 > bytes (128 bits). > Not sure how I can handle that yet. (I assume configure is my friend, but > even in that case, all headers will need to carry around the extra data.) >
I have been thinking about this a little bit. What I think should be done (and I am sure George will disagree) is to allow BTLs to define how long a segment is. The PML would then just memcpy the segments into the send buffer (instead of copying each member). For example mca_btl_base_segment_t would become: struct mca_btl_base_segment_t { size_t seg_len; }; since the pml needs the segment size (it does not need anything else). and then each btl would define its own segment like: struct mca_btl_ugni_segment_t { struct mca_btl_base_segment_t base; gni_mem_handle_t seg_key; }; and we would add: size_t btl_segment_len; to the mca_btl_base_module_t or the base frag so the pml knows how much it needs to copy. This design would address George's criticism of the length of the seg_key and also allow BTLs to do what they need to. It would require a memcpy but I disagree this would slow the critical path. Even if it does it would be relatively minor (i think) and the flexibility is worth more in the long run. -Nathan