[ 
https://issues.apache.org/jira/browse/ARROW-2447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16758270#comment-16758270
 ] 

Pearu Peterson commented on ARROW-2447:
---------------------------------------

[~wesmckinn], [~pitrou], and others interested in this:

I'd like to revive this issue and discuss the memory buffer model on 
heterogeneous systems where accessing device memory is possible only via 
copying of some memory content. Other than this restriction, one would wish to 
treat the device memory buffers in the same way as host memory buffers. That 
is, operations and algorithms developed for host memory buffer should be 
applicable to device memory buffer without any extra work required.

So, let may lay down some of the key features and issues of this problem as 
follows.

In CUDA, one can allocate memory that is accessible both from host and device 
without explicit copying. This includes:
(i) managed memory (allocated using `cudaMallocManaged`) where the device 
driver will handle the copying step on demand triggered by page faults;
(ii) host memory (allocated using `cudaMallocHost`) where host memory is 
accessible from a device via DMA;
(iii) host memory (allocated using `malloc` or `new`) that is page-locked using 
`cudaHostRegister`, and again, the host memory is accessible from a device via 
DMA.

DEFINITION: The notion "accessible from a device" (CPU or GPU) means that with 
a given memory pointer to data one can read and write data by pointer 
dereferencing within the device process. This is possible when host and device 
memories uses the same virtual memory area (VMA) or there exists a mapping 
between host and device VMAs where the access requires pointer value 
transformation (but no data copying) using `cudaHostGetDevicePointer`, for 
instance.

On the other hand, one can allocate memory that will not be accessible neither 
from device nor from host:
(iv) host memory (allocated using `malloc` or `new`) is generally not 
accessible from device;
(v) device memory (allocate using `cudaMalloc`) is generally not accessible 
from host.

Finally, there is also a need to provide a way for one device to access the 
memory of another device. While CUDA provides various (copy or no-copy) methods 
for establishing the access between different GPU devices, there exists other 
accelerators (FPGA, etc) that memory buffers would need to be made accessible 
by other devices. Although, when direct connection between the two devices is 
missing, the host RAM can be used as a copy buffer.

So, my first suggestion is to revise the title of this issue, "Create a device 
abstraction" as well as the proposed Device abstraction semantics because these 
consider only the cases (iv) and (v) while the Device abstraction would suggest 
suboptimal usage for the cases (i), (ii) and (iii). For instance, managed 
memory could be hold both by `Buffer` as well as `CudaBuffer` while in the 
latter case unnecessary copies will be made by algorithms that need to access 
the `CudaBuffer` memory from a host process.

The memory buffer abstraction should capture the following data flow cases:
(A) if buffer data is accessible within a device process (using pointer value 
or its transformed value), then the process will interpret the buffer data 
pointer as device pointer.
(B) if buffer data is not accessible within a device process, then the process 
or the buffer object needs to implement a copy method. The copy method would 
involve a data copy as well as memory management of a temporary buffer.

The proposed Device abstraction already involves components for both (A) and 
(B) but I could see that it can be generalized for arbitrary devices and 
provide optimal abstraction for data access and movement. For instance:
1. replace `cpu_data()` and `on_cpu` with `accessible_data(const Device& 
other)` and `is_accessible(const Device& other)`, respectively;
2. replace `CopyToCpu` and `CopyFromCpu` with `CopyTo` and `CopyFrom`, 
respectively;
3. instead of `Buffer`, `CudaBuffer`, `FPGABuffer`, etc just have `Buffer`;
4. internally, use `uintptr_t` to hold buffer pointer value instead of 
`uint8_t*` to prevent accidental dereferencing when the pointer is not 
accessible from the given device . `accessible_data` can still return 
`uint8_t*`;

Atm, I don't have a good replacement for a `Device` as its substance is not to 
specify a particular "device" but a "memory area". While in many cases these 
coincide, there are cases where a memory area can represent the memory of 
multiple devices (see cases (i)-(iii) above). Perhaps replace `Device` with 
`VirtualMemoryArea`? Or `MemoryPool`? Other ideas?

> [C++] Create a device abstraction
> ---------------------------------
>
>                 Key: ARROW-2447
>                 URL: https://issues.apache.org/jira/browse/ARROW-2447
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++, GPU
>    Affects Versions: 0.9.0
>            Reporter: Antoine Pitrou
>            Priority: Major
>
> Right now, a plain Buffer doesn't carry information about where it actually 
> lies. That information also cannot be passed around, so you get APIs like 
> {{PlasmaClient}} which take or return device number integers, and have 
> implementations which hardcode operations on CUDA buffers. Also, unsuspecting 
> receivers of a {{Buffer}} pointer may try to act on the underlying memory 
> without knowing whether it's CPU-reachable or not.
> Here is a sketch for a proposed Device abstraction:
> {code}
> class Device {
>     enum DeviceKind { KIND_CPU, KIND_CUDA };
>     virtual DeviceKind kind() const;
>     //MemoryPool* default_memory_pool() const;
>     //std::shared_ptr<Buffer> Allocate(...);
> };
> class CpuDevice : public Device {};
> class CudaDevice : public Device {
>     int device_num() const;
> };
> class Buffer {
>     virtual DeviceKind device_kind() const;
>     virtual std::shared_ptr<Device> device() const;
>     virtual bool on_cpu() const {
>         return true;
>     }
>     const uint8_t* cpu_data() const {
>         return on_cpu() ? data() : nullptr;
>     }
>     uint8_t* cpu_mutable_data() {
>         return on_cpu() ? mutable_data() : nullptr;
>     }
>     virtual CopyToCpu(std::shared_ptr<Buffer> dest) const;
>     virtual CopyFromCpu(std::shared_ptr<Buffer> src);
> };
> class CudaBuffer : public Buffer {
>     virtual bool on_cpu() const {
>         return false;
>     }
> };
> CopyBuffer(std::shared_ptr<Buffer> dest, const std::shared_ptr<Buffer> src);
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to