khwilson commented on code in PR #44120:
URL: https://github.com/apache/arrow/pull/44120#discussion_r1762179315
##########
docs/source/python/extending_types.rst:
##########
@@ -116,73 +116,103 @@ a :class:`~pyarrow.Array` or a
:class:`~pyarrow.ChunkedArray`.
Defining extension types ("user-defined types")
-----------------------------------------------
-Arrow has the notion of extension types in the metadata specification as a
-possibility to extend the built-in types. This is done by annotating any of the
-built-in Arrow data types (the "storage type") with a custom type name and
-optional serialized representation ("ARROW:extension:name" and
-"ARROW:extension:metadata" keys in the Field’s custom_metadata of an IPC
-message).
-See the :ref:`format_metadata_extension_types` section of the metadata
-specification for more details.
-
-Pyarrow allows you to define such extension types from Python by subclassing
-:class:`ExtensionType` and giving the derived class its own extension name
-and serialization mechanism. The extension name and serialized metadata
-can potentially be recognized by other (non-Python) Arrow implementations
+Arrow affords a notion of extension types which allow users to annotate data
+types with additional semantics. This allows developers both to
+specify custom serialization and deserialization routines (for example,
+to :ref:`Python scalars <custom-scalar-conversion>` and
+:ref:`pandas <conversion-to-pandas>`) and to more easily interpret data.
+
+In Arrow, :ref:`extension types <format_metadata_extension_types>`
+are specified by annotating any of the built-in Arrow data types
+(the "storage type") with a custom type name and, optionally, a
+bytestring that can be used to provide additional metadata (referred to as
+"parameters" in this documentation). These appear as the
+``ARROW:extension:name`` and ``ARROW:extension:metadata`` keys in the
+Field's ``custom_metadata``.
+
+Note that since these annotations are part of the Arrow specification,
+they can potentially be recognized by other (non-Python) Arrow consumers
such as PySpark.
-For example, we could define a custom UUID type for 128-bit numbers which can
-be represented as ``FixedSizeBinary`` type with 16 bytes::
-
- class UuidType(pa.ExtensionType):
-
- def __init__(self):
- super().__init__(pa.binary(16), "my_package.uuid")
-
- def __arrow_ext_serialize__(self):
- # Since we don't have a parameterized type, we don't need extra
- # metadata to be deserialized
- return b''
+PyArrow allows you to define extension types from Python by subclassing
+:class:`ExtensionType` and giving the derived class its own extension name
+and mechanism to (de)serialize any parameters. For example, we could define
+a custom rational type for fractions which can be represented as a pair of
+integers::
+
+ class RationalType(pa.ExtensionType):
+
+ def __init__(self, data_type: pa.DataType):
+ if not pa.types.is_integer(data_type):
+ raise TypeError(f"data_type must be an integer type not
{data_type}")
+
+ super().__init__(
+ pa.struct(
+ [
+ ("numer", data_type),
+ ("denom", data_type),
+ ],
+ ),
+ "my_package.rational",
+ )
+
+ def __arrow_ext_serialize__(self) -> bytes:
+ # No parameters are necessary
+ return b""
@classmethod
def __arrow_ext_deserialize__(cls, storage_type, serialized):
# Sanity checks, not required but illustrate the method signature.
- assert storage_type == pa.binary(16)
- assert serialized == b''
- # Return an instance of this subclass given the serialized
- # metadata.
- return UuidType()
+ assert pa.types.is_struct(storage_type)
+ assert pa.types.is_integer(storage_type[0].type)
+ assert storage_type[0].type == storage_type[1].type
+ assert serialized == b""
+
+ # return an instance of this subclass
+ return RationalType(storage_type[0].type)
+
The special methods ``__arrow_ext_serialize__`` and
``__arrow_ext_deserialize__``
-define the serialization of an extension type instance. For non-parametric
-types such as the above, the serialization payload can be left empty.
+define the serialization and deserialization of an extension type instance.
This can now be used to create arrays and tables holding the extension type::
- >>> uuid_type = UuidType()
- >>> uuid_type.extension_name
- 'my_package.uuid'
- >>> uuid_type.storage_type
- FixedSizeBinaryType(fixed_size_binary[16])
-
- >>> import uuid
- >>> storage_array = pa.array([uuid.uuid4().bytes for _ in range(4)],
pa.binary(16))
- >>> arr = pa.ExtensionArray.from_storage(uuid_type, storage_array)
+ >>> rational_type = RationalType(pa.int32())
+ >>> rational_type.extension_name
+ 'my_package.rational'
+ >>> rational_type.storage_type
+ StructType(struct<numer: int32, denom: int32>)
+
+ >>> storage_array = pa.array(
+ ... [
+ ... {"numer": 10, "denom": 17},
+ ... {"numer": 20, "denom": 13},
+ ... ],
+ ... type=rational_type.storage_type,
+ ... )
+ >>> arr = rational_type.wrap_array(storage_array)
+ >>> # or equivalently
+ >>> arr = pa.ExtensionArray.from_storage(rational_type, storage_array)
>>> arr
- <pyarrow.lib.ExtensionArray object at 0x7f75c2f300a0>
+ <pyarrow.lib.ExtensionArray object at 0x1067f5420>
+ -- is_valid: all not null
+ -- child 0 type: int32
+ [
+ 10,
+ 20
+ ]
Review Comment:
This is actually a problem in several places in the documentation. It seems
a lot of people's formatters assume 4 spaces when pyarrow always prints out 2.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]