[ 
https://issues.apache.org/jira/browse/ARROW-18273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Miles Granger updated ARROW-18273:
----------------------------------
    Description: 
Currently, compute kernels don't recognize extensions types so that if you were 
to define semantic types to indicate things like "this string column is an 
image label", you then cannot do things like equals on it.

For example, take the LabelType from 
[https://github.com/apache/arrow/blob/c3824db8530075e0f8fd26974c193a310003c17a/python/pyarrow/tests/test_extension_type.py]

{code:python}
In [1]: import pyarrow as pa

In [2]: import pyarrow.compute as pc

In [3]: class LabelType(pa.PyExtensionType):
...:
...: def _{_}init{_}_(self):
...: pa.PyExtensionType._{_}init{_}_(self, pa.string())
...:
...: def _{_}reduce{_}_(self):
...: return LabelType, ()
...:

In [4]: tbl = pa.Table.from_arrays([pa.ExtensionArray.from_storage(LabelType(), 
pa.array(['cat', 'dog', 'person']))], names=['label'])

In [5]: tbl.filter(pc.field('label') == 'cat')
---------------------------------------------------------------------------
ArrowNotImplementedError Traceback (most recent call last)
Cell In [5], line 1
----> 1 tbl.filter(pc.field('label') == 'cat')

File ~/.venv/lance/lib/python3.10/site-packages/pyarrow/table.pxi:2953, in 
pyarrow.lib.Table.filter()

File ~/.venv/lance/lib/python3.10/site-packages/pyarrow/_exec_plan.pyx:391, in 
pyarrow._exec_plan._filter_table()

File ~/.venv/lance/lib/python3.10/site-packages/pyarrow/_exec_plan.pyx:128, in 
pyarrow._exec_plan.execplan()

File ~/.venv/lance/lib/python3.10/site-packages/pyarrow/error.pxi:144, in 
pyarrow.lib.pyarrow_internal_check_status()

File ~/.venv/lance/lib/python3.10/site-packages/pyarrow/error.pxi:121, in 
pyarrow.lib.check_status()

ArrowNotImplementedError: Function 'equal' has no kernel matching input types 
(extension<arrow.py_extension_type<LabelType>>, string)
```

for query systems that push some of the compute down to Arrow (e.g., DuckDB), 
it also means that it's much harder for users to work with datasets with 
extension types because you don't know which functions will actually work.

Instead, if we can make the compute kernels default to the storage type, it 
would make the extension system a lot easier to work with in Arrow.

  was:
Currently, compute kernels don't recognize extensions types so that if you were 
to define semantic types to indicate things like "this string column is an 
image label", you then cannot do things like equals on it.

For example, take the LabelType from 
https://github.com/apache/arrow/blob/c3824db8530075e0f8fd26974c193a310003c17a/python/pyarrow/tests/test_extension_type.py

```
In [1]: import pyarrow as pa

In [2]: import pyarrow.compute as pc

In [3]: class LabelType(pa.PyExtensionType):
   ...:
   ...:     def __init__(self):
   ...:         pa.PyExtensionType.__init__(self, pa.string())
   ...:
   ...:     def __reduce__(self):
   ...:         return LabelType, ()
   ...:

In [4]: tbl = pa.Table.from_arrays([pa.ExtensionArray.from_storage(LabelType(), 
pa.array(['cat', 'dog', 'person']))], names=['label'])

In [5]: tbl.filter(pc.field('label') == 'cat')
---------------------------------------------------------------------------
ArrowNotImplementedError                  Traceback (most recent call last)
Cell In [5], line 1
----> 1 tbl.filter(pc.field('label') == 'cat')

File ~/.venv/lance/lib/python3.10/site-packages/pyarrow/table.pxi:2953, in 
pyarrow.lib.Table.filter()

File ~/.venv/lance/lib/python3.10/site-packages/pyarrow/_exec_plan.pyx:391, in 
pyarrow._exec_plan._filter_table()

File ~/.venv/lance/lib/python3.10/site-packages/pyarrow/_exec_plan.pyx:128, in 
pyarrow._exec_plan.execplan()

File ~/.venv/lance/lib/python3.10/site-packages/pyarrow/error.pxi:144, in 
pyarrow.lib.pyarrow_internal_check_status()

File ~/.venv/lance/lib/python3.10/site-packages/pyarrow/error.pxi:121, in 
pyarrow.lib.check_status()

ArrowNotImplementedError: Function 'equal' has no kernel matching input types 
(extension<arrow.py_extension_type<LabelType>>, string)
```

for query systems that push some of the compute down to Arrow (e.g., DuckDB), 
it also means that it's much harder for users to work with datasets with 
extension types because you don't know which functions will actually work.


Instead, if we can make the compute kernels default to the storage type, it 
would make the extension system a lot easier to work with in Arrow.



> [Python] For extension types, compute kernels should default to storage types?
> ------------------------------------------------------------------------------
>
>                 Key: ARROW-18273
>                 URL: https://issues.apache.org/jira/browse/ARROW-18273
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++, Python
>    Affects Versions: 10.0.0
>            Reporter: Chang She
>            Priority: Major
>
> Currently, compute kernels don't recognize extensions types so that if you 
> were to define semantic types to indicate things like "this string column is 
> an image label", you then cannot do things like equals on it.
> For example, take the LabelType from 
> [https://github.com/apache/arrow/blob/c3824db8530075e0f8fd26974c193a310003c17a/python/pyarrow/tests/test_extension_type.py]
> {code:python}
> In [1]: import pyarrow as pa
> In [2]: import pyarrow.compute as pc
> In [3]: class LabelType(pa.PyExtensionType):
> ...:
> ...: def _{_}init{_}_(self):
> ...: pa.PyExtensionType._{_}init{_}_(self, pa.string())
> ...:
> ...: def _{_}reduce{_}_(self):
> ...: return LabelType, ()
> ...:
> In [4]: tbl = 
> pa.Table.from_arrays([pa.ExtensionArray.from_storage(LabelType(), 
> pa.array(['cat', 'dog', 'person']))], names=['label'])
> In [5]: tbl.filter(pc.field('label') == 'cat')
> ---------------------------------------------------------------------------
> ArrowNotImplementedError Traceback (most recent call last)
> Cell In [5], line 1
> ----> 1 tbl.filter(pc.field('label') == 'cat')
> File ~/.venv/lance/lib/python3.10/site-packages/pyarrow/table.pxi:2953, in 
> pyarrow.lib.Table.filter()
> File ~/.venv/lance/lib/python3.10/site-packages/pyarrow/_exec_plan.pyx:391, 
> in pyarrow._exec_plan._filter_table()
> File ~/.venv/lance/lib/python3.10/site-packages/pyarrow/_exec_plan.pyx:128, 
> in pyarrow._exec_plan.execplan()
> File ~/.venv/lance/lib/python3.10/site-packages/pyarrow/error.pxi:144, in 
> pyarrow.lib.pyarrow_internal_check_status()
> File ~/.venv/lance/lib/python3.10/site-packages/pyarrow/error.pxi:121, in 
> pyarrow.lib.check_status()
> ArrowNotImplementedError: Function 'equal' has no kernel matching input types 
> (extension<arrow.py_extension_type<LabelType>>, string)
> ```
> for query systems that push some of the compute down to Arrow (e.g., DuckDB), 
> it also means that it's much harder for users to work with datasets with 
> extension types because you don't know which functions will actually work.
> Instead, if we can make the compute kernels default to the storage type, it 
> would make the extension system a lot easier to work with in Arrow.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to