timsaucer opened a new issue, #1301:
URL: https://github.com/apache/datafusion-python/issues/1301
**Is your feature request related to a problem or challenge? Please describe
what you are trying to do.**
Suppose I have a pyarrow scalar value that contains an extension type. If I
try turning that into a literal expression in datafusion, we should get the
associated metadata transparently to the user.
Consider this minimal example:
```python
import pyarrow as pa
import uuid
from datafusion import lit
value = pa.scalar(uuid.uuid4().bytes, pa.uuid())
print(lit(value))
```
This currently fails with `ArrowTypeError: Expected bytes, got a 'UUID'
object`. That can be overcome with the simple patch
```patch
--- a/src/pyarrow_util.rs
+++ b/src/pyarrow_util.rs
@@ -30,7 +30,11 @@ impl FromPyArrow for PyScalarValue {
fn from_pyarrow_bound(value: &Bound<'_, PyAny>) -> PyResult<Self> {
let py = value.py();
let typ = value.getattr("type")?;
- let val = value.call_method0("as_py")?;
+ let val = if value.hasattr("value")? {
+ value.getattr("value")?
+ } else {
+ value.call_method0("as_py")?
+ };
```
But then we still don't have the metadata. It is lost and we get a bare
fixed sized binary.
**Describe the solution you'd like**
The above code should *just work*. I have done a little investigation and
using the pycapsule interface we *can* get the schema of the array we generate
inside `PyScalarValue::from_pyarrow_bound`. We can then plumb this through when
calling `lit()`.
Ideally we would take this opportunity to ensure that when we call
`PyScalarValue::from_pyarrow_bound` we are also supporting other libraries
besides just `pyarrow`. There has been a complaint a few times that we are too
tightly coupled to `pyarrow`. In particular it would be good to demonstrate
that when converting a Python object that is a scalar value it works for:
- pyarrow
- nanoarrow
- arro3
- polars
I don't think we necessarily need to support pandas since they are not an
Arrow library.
**Describe alternatives you've considered**
Alternatively the user can manually turn their data into the underlying
storage and then attach the metadata from their extension type. This feels like
a poor user experience.
**Additional context**
This came up during a different investigation:
> Also worth evaluating while we're doing this: For scalar values, is it
possible for them to contain metadata? If I do `pa.scalar(uuid.uuid4().bytes,
type=pa.uuid())` and I check the `type` I should have the extension data. Maybe
this is already supported, but as part of this PR I want to evaluate that as
well.
_Originally posted by @timsaucer in
https://github.com/apache/datafusion-python/issues/1299#issuecomment-3497558869_
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]