[
https://issues.apache.org/jira/browse/ARROW-18273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17633667#comment-17633667
]
Miles Granger commented on ARROW-18273:
---------------------------------------
I think this makes good sense, although I'm not sure about the implementation
details of it. I think many (all?) kernels specify their allowed input types
before runtime, but perhaps there is a way match based on storage type as well?
cc [~jorisvandenbossche]
> [Python] For extension types, compute kernels should default to storage types?
> ------------------------------------------------------------------------------
>
> Key: ARROW-18273
> URL: https://issues.apache.org/jira/browse/ARROW-18273
> Project: Apache Arrow
> Issue Type: Improvement
> Components: C++, Python
> Affects Versions: 10.0.0
> Reporter: Chang She
> Priority: Major
>
> Currently, compute kernels don't recognize extensions types so that if you
> were to define semantic types to indicate things like "this string column is
> an image label", you then cannot do things like equals on it.
> For example, take the LabelType from
> [https://github.com/apache/arrow/blob/c3824db8530075e0f8fd26974c193a310003c17a/python/pyarrow/tests/test_extension_type.py]
> {code:python}
> In [1]: import pyarrow as pa
> In [2]: import pyarrow.compute as pc
> In [3]: class LabelType(pa.PyExtensionType):
> ...:
> ...: def __init__(self):
> ...: pa.PyExtensionType.__init__(self, pa.string())
> ...:
> ...: def __reduce__(self):
> ...: return LabelType, ()
> ...:
> In [4]: tbl =
> pa.Table.from_arrays([pa.ExtensionArray.from_storage(LabelType(),
> pa.array(['cat', 'dog', 'person']))], names=['label'])
> In [5]: tbl.filter(pc.field('label') == 'cat')
> ---------------------------------------------------------------------------
> ArrowNotImplementedError Traceback (most recent call last)
> Cell In [5], line 1
> ----> 1 tbl.filter(pc.field('label') == 'cat')
> File ~/.venv/lance/lib/python3.10/site-packages/pyarrow/table.pxi:2953, in
> pyarrow.lib.Table.filter()
> File ~/.venv/lance/lib/python3.10/site-packages/pyarrow/_exec_plan.pyx:391,
> in pyarrow._exec_plan._filter_table()
> File ~/.venv/lance/lib/python3.10/site-packages/pyarrow/_exec_plan.pyx:128,
> in pyarrow._exec_plan.execplan()
> File ~/.venv/lance/lib/python3.10/site-packages/pyarrow/error.pxi:144, in
> pyarrow.lib.pyarrow_internal_check_status()
> File ~/.venv/lance/lib/python3.10/site-packages/pyarrow/error.pxi:121, in
> pyarrow.lib.check_status()
> ArrowNotImplementedError: Function 'equal' has no kernel matching input types
> (extension<arrow.py_extension_type<LabelType>>, string)
> {code}
> for query systems that push some of the compute down to Arrow (e.g., DuckDB),
> it also means that it's much harder for users to work with datasets with
> extension types because you don't know which functions will actually work.
> Instead, if we can make the compute kernels default to the storage type, it
> would make the extension system a lot easier to work with in Arrow.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)