Re: Aligning intended target types for lists and structs when converting to pandas DataFrame

Wes McKinney Thu, 10 Sep 2020 10:59:20 -0700

I think it would make more sense to use Arrow for nested types in Ibis
-- I'm biased for having been heavily involved in both projects, but
NumPy doesn't have a very good story for nested data and so if
possible it would better to prevent technical debt from accumulating
from decisions made years ago since the Ibis project does not yet have
quite the size of user base that other Python data projects have.


For what it's worth I have intended to help create an Ibis interface
to Arrow-native computing functionality once we are able to develop
more complete query processing functionality in the project.

On Thu, Sep 10, 2020 at 11:48 AM Tim Swast <[email protected]> wrote:
>
> Hello Arrow and Ibis devs,
>
> I notice that Arrow's to_pandas method produces different types than is
> expected in the Ibis test suite.
>
>
>    -
>
>    Lists are returned as numpy arrays in Arrow, but expected to be Python
>    list objects in Ibis.
>    -
>
>    NULL values in integer columns are converted to NaN in Arrow, but Ibis
>    expects None.
>
>
> There's an argument to be made that what arrow is doing is most correct,
> and certainly more performant than Python objects. I think it'd be helpful
> if the Pandas, Ibis, and Arrow communities aligned on what the intended
> Pandas types are for these complex values.
>
> If not, maybe there are some more generic test utilities that we can use in
> Ibis to accept numpy arrays in backend output?
>
> Or maybe Ibis should start adopting Arrow directly, at least for complex
> types? Maybe via Fletcher?
>
> *Background:*
>
> I very recently sent a PR to Ibis to mark several BigQuery tests as xfail.
> github.com/ibis-project/ibis/pull/2375 I believe they started failing when
> the google-cloud-bigquery library started using Arrow's to_pandas method
> (PR: github.com/googleapis/google-cloud-python/pull/10027) instead of a
> slower method that doesn't use Arrow.
>
> These test failures are due to to_pandas returning different types than the
> Ibis tests expect, such as numpy arrays in the case of lists (ibis#2370
> <https://github.com/ibis-project/ibis/issues/2370>, ibis#2372
> <https://github.com/ibis-project/ibis/issues/2372>, ibis#2374
> <https://github.com/ibis-project/ibis/issues/2374>), NaN values for NULL
> integers (ibis#2371 <https://github.com/ibis-project/ibis/issues/2371>),
> and an unimplemented conversion for structs containing lists (ibis#2373
> <https://github.com/ibis-project/ibis/issues/2373>).
>
> I'd like to figure out what the next steps should be. Options:
>
>
>
>    -
>
>    Get BigQuery to output the currently expected Python objects in Ibis,
>    -
>
>    Change Ibis to expect more Arrow-aligned types for complex types, or
>    -
>
>    Update the Ibis tests to accept either Python objects or the output of
>    Arrow's to_pandas method.
>
>
> Thanks for your help,
>
> *  •  **Tim Swast*
> *  •  *Senior Software Friendliness Engineer, Data & Analytics
> *  •  *Google Cloud Developer Relations
> *  •  *Chicago, IL, USA

Re: Aligning intended target types for lists and structs when converting to pandas DataFrame

Reply via email to