zeroshade commented on code in PR #41180: URL: https://github.com/apache/arrow/pull/41180#discussion_r1567518245
########## docs/source/format/DissociatedIPC.rst: ########## @@ -0,0 +1,335 @@ +.. Licensed to the Apache Software Foundation (ASF) under one +.. or more contributor license agreements. See the NOTICE file +.. distributed with this work for additional information +.. regarding copyright ownership. The ASF licenses this file +.. to you under the Apache License, Version 2.0 (the +.. "License"); you may not use this file except in compliance +.. with the License. You may obtain a copy of the License at + +.. http://www.apache.org/licenses/LICENSE-2.0 + +.. Unless required by applicable law or agreed to in writing, +.. software distributed under the License is distributed on an +.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +.. KIND, either express or implied. See the License for the +.. specific language governing permissions and limitations +.. under the License. + +.. _dissociated-ipc: + +======================== +Dissociated IPC Protocol +======================== + +.. warning:: + + Experimental: The Dissociated IPC Protocol is experimental in its current + form. Based on feedback and usage the protocol definition may change until + it is fully standardized. + +Rationale +========= + +The :ref:`Arrow IPC format <format-ipc>` describes a protocol for transferring +Arrow data as a stream of record batches. This protocol expects a continuous +stream of bytes divided into discrete messages (using a length prefix and +continuation indicator). Each discrete message consists of two portions: + +* A `Flatbuffers`_ header message +* A series of bytes consisting of the flattened and packed body buffers (some + message types, like Schema messages, do not have this section) + - This is referred to as the *message body* in the IPC format spec. + +For most cases, the existing IPC format as it currently exists is extremely efficient: + +* Receiving data in the IPC format allows zero-copy utilization of the body + buffer bytes, no deserialization is required to form Arrow Arrays +* An IPC (Feather) file can be memory-mapped because it is location agnostic + and the bytes of the file are exactly what is expected in memory. + +However, there are use cases that aren't handled by this: + +* Constructing the IPC record batch message requires allocating a contiguous + chunk of bytes and copying all of the data buffers into it, packed together + back-to-back. It's exceedingly difficult to zero-copy **create** IPC messages. +* If the Arrow data is located in a shared-memory location, there is no standard + way to share the handle to the shared-memory across processes or transports that + allow for remote memory accessing. +* Arrow data located on a non-CPU device (such as a GPU) cannot be sent using + Arrow IPC without having to copy the data back to the host device or copying + the flatbuffer metadata bytes into device memory. + - By the same token, receiving IPC messages into device memory would require + performing a copy of the flatbuffer metadata back to the host CPU device. This + is due to the fact that the IPC stream interleaves data and metadata across a + single stream. + +This protocol is intended to attempt to solve these use cases in an efficient manner. + +Goals +----- + +* Define a generic protocol for passing Arrow IPC data, not tied to any particular + transport, that also allows for utilizing non-CPU device memory, shared memory, and + newer "high performance" transports such as `ucx`_ or `libfabric`_. +* Allow for using :ref:`Flight RPC <flight-rpc>` purely for control flow by separating + the stream of IPC metadata from IPC body bytes + - This allows for the data in the body to be kept on non-CPU devices (like GPUs) + without expensive Device -> Host copies. + +Definitions +----------- + +.. glossary:: + + IPC Metadata + The flatbuffer message bytes that encompass the header of an Arrow IPC message + + Tag + A ``uint64`` value used as an ID for a message. Specific bits can be masked to + allow identifying messages by only a portion of the tag, leaving the rest of the + bits to be used for control flow or other message metadata. + + Sequence Number + A little-endian, 4-byte unsigned integer starting at 0 for a stream, indicating + the sequence order of messages. It is also used to identify specific messages to + tie the IPC metadata header to its corresponding body since the metadata and body + can be sent across separate pipes/streams/transports. + + If a sequence number reaches ``UINT32_MAX``, it should be allowed to roll over as + it is unlikely there would be enough unprocessed messages waiting to be processed + that would cause an overlap of sequence numbers. + + The sequence number serves two purposes: To identify corresponding metadata and + tagged body data messages and to ensure we do not rely on messages having to arrive + in order. A client should use the sequence number to correctly order messages as + they arrive for processing. + + Backpressure + *Currently* this proposal does not specify any way to manage the backpressure of + messages to throttle for memory and bandwidth reasons. For now, this will be + **transport-defined** rather than lock into something sub-optimal. + + As usage among different transports and libraries grows, common patterns will emerge + that will allow for a generic, but efficient, way to handle backpressure across + different use cases. + + .. note:: + While the protocol itself is transport agnostic, the current usage and examples + only have been tested using UCX and libfabric transports so far, but that's all. + + +Protocol Description +==================== + +A reference example implementation utilizing `libcudf`_ and `ucx`_ can be found at +https://github.com/zeroshade/cudf-flight-ucx. + +Requirements +------------ + +A transport implementing this protocol **MUST** provide two pieces of functionality: + +* Message sending + - Delimited messages (like gRPC) as opposed to non-delimited streams (like plain TCP + without further framing). + - Alternatively, a framing mechanism like the `encapsulated message format <ipc-message-format>` + for the IPC protocol can be used while leaving out the body bytes. +* Tagged message sending + - Sending a message that has an attached little-endian, unsigned 64-bit integral tag + for control flow. A tag like this allows control flow to operate on a message whose body + is on a non-CPU device without requiring the message itself to get copied off of the device. + +URI Specification +----------------- + +When providing a URI to a consumer to contact for use with this protocol (such as via +the `Location URI for Flight <flight-location-uris>`_), the URI should specify a scheme +like *ucx://* or *fabric://*, that is easily identifiable. In addition, the URI should +encode the following URI query parameters: + +.. note:: + As this protocol matures, this document will get updated with commonly recognized + transport schemes that get used with it. + +* ``want_data`` - **REQUIRED** - uint64 integer value + - This value should be used to tag an initial message to the server to initiate a + data transfer. The body of the initiating message should be an opaque binary identifier + of the data stream being requested (like the ``Ticket`` in the Flight RPC protocol) +* ``free_data`` - **OPTIONAL** - uint64 integer value + - If the server might send messages using offsets / addresses for remote memory accessing + or shared memory locations, the URI should include this parameter. This value is used to + tag messages sent from the client to the data server, containing specific offsets / addresses + which were provided that are no longer required by the client (i.e. any operations that + directly reference those memory locations, such as copying the remote data into local memory, + have been completed). +* ``remote_handle`` - **OPTIONAL** - base64-encoded string + - When working with shared memory or remote memory, this value indicates any required + handle or identifier that is necessary for accessing the memory. + + Using UCX, this would be an *rkey* value + + With CUDA IPC, this would be the value of the base GPU pointer or memory handle, + and subsequent addresses would be offsets from this base pointer. Review Comment: If you are using absolute pointers into GPU space, then you don't need to pass the `remote_handle` parameter at all. If you are using a CUDA IPC mem handle for making memory accessible in another process (or more likely, received a CUDA IPC mem handle), then you need to pass that across to identify the region of device memory that was exported for external process usage. The IPC mem handle is a GPU base pointer to a particular allocation (see https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__DEVICE.html#group__CUDART__DEVICE_1g8a37f7dfafaca652391d0758b3667539). Essentially, we're supporting both cases in this protocol: Absolute pointers (no need for using this `remote_handle`) and IPC mem handle (the GPU base pointer) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
