When you use the runtime API, you are using the C compiler to generate host 
code to populate the arrays.  The compiler knows the sizeof() the array 
elements, and the individual offsets for each field within a struct, because it 
is responsible for laying out the data structure in memory.  Part of what the 
compiler has to know is what the alignment restrictions are for the target 
architecture.  In the case of the runtime CUDA API and nvcc, the host and 
device compilers are in agreement about what these rules are, so that when data 
is copied from host to device by cudaMemcpy, the layout does not need to be 
altered in any way.  The compiler, on both sides, also ensures that pointer 
arithmetic uses the right offsets, which also requires known the sizeof() the 
data type the pointer points at.  No explicit padding is ever required.

However, in PyCUDA, the Python side of things does not directly have access to 
the information in the C compiler, in particular the sizeof() operator or 
struct member offsets (which are computed at compile time).  The 
pycuda.characterize module provides a reimplementation of sizeof(), so you can 
be sure to allocate the right amount of memory and know the memory interval 
between array elements.  However, it does not compute the offsets to individual 
struct elements, so you either have to manually do that the same way the 
compiler would, or follow the convention mentioned to always order struct 
elements in order of descending size.  Inserting padding member fields in a 
struct is just a helpful way to remind yourself where the gaps are in the 
struct layout so that you mimic them in your Python code.  They are not 
required if you are manually putting in correct offsets, using your knowledge 
of what the C compiler will do.


The two cases you compare (Vec and DoubleOperation) are actually structs with 
different sizes, where the only reason they have different sizes is to satisfy 
the alignment requirement for the first member in the struct.  In one case, the 
first member is a pointer, and needs 8-byte alignment.  In the second case, the 
first member is a float, and 4 byte alignment is sufficient.  You can see how 
this works on the host using the following C++ program:

// alignment.cxx
#include <iostream>

struct threefloat {
  float x,y,z;
};

struct pointerint {
  float *a;
  int    n;
};

int main() {
  std::cout  << "sizeof(int) = " << sizeof(int) << std::endl;
  std::cout  << "sizeof(float) = " << sizeof(float) << std::endl;
  std::cout  << "sizeof(float*) = " << sizeof(float*) << std::endl;
  std::cout  << "sizeof(threefloat) = " << sizeof(threefloat) << std::endl;
  std::cout  << "sizeof(pointerint) = " << sizeof(pointerint) << std::endl;

  return 0;
}

On my 64-bit Linux system, the output of this program is:

sizeof(int) = 4
sizeof(float) = 4
sizeof(float*) = 8
sizeof(threefloat) = 12
sizeof(pointerint) = 16


Basically, I've found that I stay out of alignment trouble in PyCUDA by 
following these two rules:

        1. Arrange members of structs in descending size order.
        2. Always uses pycuda.characterize.sizeof() to compute the size of a 
struct.


On Jan 21, 2012, at 2:06 PM, Yifei Li wrote:

> 1) The example fails to work without padding, the second block prints the 
> wrong 'datalen'.
> However, if I use CUDA runtime API instead of pycuda, the result is correct 
> even without the padding. 
>  
> The trick suggested by Stanley works. So let me answer this question myself: 
> the C compiler automatically figures out the size of the struct should be 16 
> instead of 12, which matches the size of the struct on the device.  
> 
> Actually, I found that if the program is written using CUDA runtime API,  you 
> don't need to worry about alignment at all. For example, I tried several 
> structs with different sizes, and the values of the fields in a struct are 
> always printed correctly.
> 
> Why is that? This seems to contradict with the following (quoted from Chapter 
> 5 of CUDA 4.0 programming guide):
> 
> A typical case where this might be easily overlooked is when using some 
> custom global memory allocation scheme, whereby the allocations of multiple 
> arrays (with multiple calls to cudaMalloc() or cuMemAlloc()) is replaced by 
> the allocation of a single large block of memory partitioned into multiple 
> arrays, in which case the starting address of each array is offset from the 
> blockā€Ÿs starting address.  
> 
> 
> 2) Since the size of the struct without padding is 12 bytes, I tried a 
> different struct but of the same size:
> struct DoubleOperation{
>      float x;
>      float y;
>      float z;
> } 
> And the kernel function is changed to 
>  __global__ void test(DoubleOperation *a) {
>         a = &a[blockIdx.x];
>         printf("block %d: %f %f %fn", blockIdx.x, a->x, a->y, a->z);
> But this time the values of x, y and z are printed correctly by both blocks. 
> So why does it work even though the struct's size is the same as before?
> 
> However, I still don't have answer for this.
> 
>  
> 
> 
> _______________________________________________
> PyCUDA mailing list
> [email protected]
> http://lists.tiker.net/listinfo/pycuda


_______________________________________________
PyCUDA mailing list
[email protected]
http://lists.tiker.net/listinfo/pycuda

Reply via email to