pkrack commented on issue #46007: URL: https://github.com/apache/arrow/issues/46007#issuecomment-4213757087
I investigated this a bit further and #46007, #45262, and #44853 seem to have the same underlying issue: recursive instantiation / conversion does not automatically unwrap to storage type then rewrap to the extension type. Top-level construction with `pa.array` has a special case in `python/pyarrow/array.pxi` [at line 265](https://github.com/apache/arrow/blob/35fb62e6224617d5ae749533654e9f8c7a6250c7/python/pyarrow/array.pxi#L265). `ExtensionType`s nested in structured types do not go through that special case. Instead this case is handled in [`python/pyarrow/src/arrow/python/python_to_arrow.cc`](https://github.com/apache/arrow/blob/main/python/pyarrow/src/arrow/python/python_to_arrow.cc), where `PyConverterTrait` is not implemented for `ExtensionType`s. Because of this, `MakeConverter` in [`cpp/src/arrow/util/converter.h`](https://github.com/apache/arrow/blob/main/cpp/src/arrow/util/converter.h) falls back to the generic `Visit` implementation which returns `Status::NotImplemented(t.name())` (cf. [`converter.h:L251`](https://github.com/apache/arrow/blob/main/cpp/src/arrow/util/converter.h#L251)), which results in the observed `ArrowNotImplementedError: extension`. Related issue: there is no builder for extension types, i.e. you also can not automatically create arrays with nested extension types in C++ either with the Builder interface. The `Converter`s in `python_to_arrow.cc` typically use such a builder internally (see also `converter.h`). So basically what needs to be done here is using the "unwrap to storage type then rewrap" trick that is already used in different parts of the code base. The question is where this should happen: 1. in a builder, then the converters use these builders and the top level special case can be removed. I.e new builder + converter implementation for extension types -> then nested extension types are also supported in C++ 2. in a converter, which then instantiates a builder for the storage type. I.e. new converter class -> nested extension types would only be supported in python. 3. in the container types (list, map, etc.) -> requires changes to all container types. Perhaps some macro / template magic? This would be some more duplication but on the other hand it follows the idea which seems to be represented in the code base: consumers should work with the storage type. Workaround for python: construct the extension-typed child array explicitly first, then use the corresponding from_arrays / from_array constructor, for example: `pa.FixedSizeListArray.from_arrays(pa.array(["{'a':1}"], type=pa.json_()), type=pa.list_(pa.json_(), 1))`. Top level construction works, automatic recursive construction does not. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
