Joris Van den Bossche commented on ARROW-5610:

OK, I am making some progress on this (I initially was disregarding the 
parametrized type case, so we indeed need C++ <-> Python interaction). I have 
basic roundtripping with a parametrized type working. Eg in Python an 
implementor can do:
class PeriodType(pa.GenericExtensionType):

    def __init__(self, freq):
        # attributes need to be set first before calling super init (as that 
calls serialize)
        self.freq = freq
        pa.lib.GenericExtensionType.__init__(self, pa.int64(), 'pandas.period')
    def __arrow_ext_serialize__(self):
        return "freq={}".format(self.freq).encode()
    def __arrow_ext_deserialize__(cls, storage_type, serialized):
        serialized = serialized.decode()
        assert serialized.startswith("freq=")
        freq = serialized.split('=')[1]
        return PeriodType(freq)

period_type = PeriodType('D')
and that can roundtrip IPC with the "pandas.period" extension name (so not a 
generic "arrow.py_extension").

I based the above interface (the {{__arrow_ext_serialize_}} _and 
{{}}_{{_arrow_ext_deserialize__}} methods to implement) on the existing 
{{PyExtensionType}} that Antoine implemented.
{quote}> I assume the generic ExtensionType would have a Python "vtable" for 
Python subclasses to implement the C++ methods
So currently I based myself on the existing {{PyExtensionType}} and copied the 
approach there to store a weakref to an instance and the class of the Python 
subclass the user defines. 
 That seems to work, but I am not familiar enough with this to judge if the 
vtable approach (as used in PyFlightServer) would be better.
{quote}> The registration method would need to support parameterized types as 
well (i.e. registering multiple instances of the same type with different 
Is that needed? My current idea is that you would register a certain type once 
(with _some_ parametrization, so you don't have to register each possible 
parametrization). Because we register in C++ based on the name, so otherwise 
the name would need to include the parameter. Actually, writing this down now, 
that could also be an option (currently I use the serialized metadata for 
storing the parametrization).

Other questions I still need to answer:
 - What to do with registration and unregistration? It would be nice if a user 
didn't need to register a type manually (in python that could be done with a 
metaclass to register the subclass on definition, but not sure that is possible 
in cython)
 Also for unregistering, since that is needed to avoid segfaults on shutdown, 
we probably need to keep a python side registry of the C++-registered types to 
ensure we unregister them on shutdown.
 - Do we want to keep the current {{PyExtensionType}} based on pickle? I think 
the main advantage compared to the new implementation is that when reading an 
IPC message, the type does not need to be registered to be recognized (for the 
unpickling, it is enough that the module is importable, but does not need to be 
imported manually by the user). But on the other hand it gives two largely 
overlapping alternatives.

I will try to clean up and push to a draft PR, which will be easier to get an 


> [Python] Define extension type API in Python to "receive" or "send" a foreign 
> extension type
> --------------------------------------------------------------------------------------------
>                 Key: ARROW-5610
>                 URL: https://issues.apache.org/jira/browse/ARROW-5610
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: Python
>            Reporter: Wes McKinney
>            Priority: Major
>             Fix For: 1.0.0
> In work in ARROW-840, a static {{arrow.py_extension_type}} name is used. 
> There will be cases where an extension type is coming from another 
> programming language (e.g. Java), so it would be useful to be able to "plug 
> in" a Python extension type subclass that will be used to deserialize the 
> extension type coming over the wire. This has some different API requirements 
> since the serialized representation of the type will not have knowledge of 
> Python pickling, etc. 

This message was sent by Atlassian JIRA

Reply via email to