timsaucer opened a new issue, #1301:
URL: https://github.com/apache/datafusion-python/issues/1301

   **Is your feature request related to a problem or challenge? Please describe 
what you are trying to do.**
   
   Suppose I have a pyarrow scalar value that contains an extension type. If I 
try turning that into a literal expression in datafusion, we should get the 
associated metadata transparently to the user.
   
   Consider this minimal example:
   
   ```python
   import pyarrow as pa
   import uuid
   from datafusion import lit
   
   value = pa.scalar(uuid.uuid4().bytes, pa.uuid())
   
   print(lit(value))
   ```
   
   This currently fails with `ArrowTypeError: Expected bytes, got a 'UUID' 
object`. That can be overcome with the simple patch
   
   ```patch
   --- a/src/pyarrow_util.rs
   +++ b/src/pyarrow_util.rs
   @@ -30,7 +30,11 @@ impl FromPyArrow for PyScalarValue {
        fn from_pyarrow_bound(value: &Bound<'_, PyAny>) -> PyResult<Self> {
            let py = value.py();
            let typ = value.getattr("type")?;
   -        let val = value.call_method0("as_py")?;
   +        let val = if value.hasattr("value")? {
   +            value.getattr("value")?
   +        } else {
   +            value.call_method0("as_py")?
   +        };
   ```
   
   But then we still don't have the metadata. It is lost and we get a bare 
fixed sized binary.
   
   **Describe the solution you'd like**
   
   The above code should *just work*. I have done a little investigation and 
using the pycapsule interface we *can* get the schema of the array we generate 
inside `PyScalarValue::from_pyarrow_bound`. We can then plumb this through when 
calling `lit()`.
   
   Ideally we would take this opportunity to ensure that when we call 
`PyScalarValue::from_pyarrow_bound` we are also supporting other libraries 
besides just `pyarrow`. There has been a complaint a few times that we are too 
tightly coupled to `pyarrow`. In particular it would be good to demonstrate 
that when converting a Python object that is a scalar value it works for:
   
   - pyarrow
   - nanoarrow
   - arro3
   - polars
   
   I don't think we necessarily need to support pandas since they are not an 
Arrow library.
   
   **Describe alternatives you've considered**
   
   Alternatively the user can manually turn their data into the underlying 
storage and then attach the metadata from their extension type. This feels like 
a poor user experience.
   
   **Additional context**
   
   This came up during a different investigation:
   
   > Also worth evaluating while we're doing this: For scalar values, is it 
possible for them to contain metadata? If I do `pa.scalar(uuid.uuid4().bytes, 
type=pa.uuid())` and I check the `type` I should have the extension data. Maybe 
this is already supported, but as part of this PR I want to evaluate that as 
well.
   
   _Originally posted by @timsaucer in 
https://github.com/apache/datafusion-python/issues/1299#issuecomment-3497558869_
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to