[GitHub] [arrow] jorisvandenbossche commented on issue #34886: `np.asarray(parrow_table)` returns a transposed representation of the data

via GitHub Fri, 07 Apr 2023 12:45:42 -0700


jorisvandenbossche commented on issue #34886:
URL: https://github.com/apache/arrow/issues/34886#issuecomment-1500588323


   I would still call it a bug (if it works, i.e. it returns something, it 
shouldn't transpose the data), but I think it is indeed caused because we only 
implemented numpy compatibility on the array level, as Dane mentioned. 
   
   When doing `np.asarray(..)` on a pyarrow Table, numpy sees an object that 
hasn't any of the protocol methods like `__array__`, but it does see an 
iterable object with getitem, and so will try to convert it to an array like 
any list like. Illustrating this with converting to a list:
   
   ```
   In [2]: table = pa.table({'a': [1, 2, 3], 'b': [4, 5, 6]})
   
   In [3]: list(table)
   Out[3]: 
   [<pyarrow.lib.ChunkedArray object at 0x7fb21b832e30>
    [
      [
        1,
        2,
        3
      ]
    ],
    <pyarrow.lib.ChunkedArray object at 0x7fb21b8328e0>
    [
      [
        4,
        5,
        6
      ]
    ]]
   ```
   
   So we get here a list of the column values, each being a ChunkedArray. But 
because those arrays now actually do have numpy compatibility with `__array__`, 
numpy will actually further unpack those and instead of creating a 1D array of 
the column objects, it creates a 2D array. But with the number of columns (how 
it got unpacked initially) as the first dimension. And this then results in 
this "transposed" result compared to how you would expect it.
   
   Leaving this as is doesn't sound as a good idea, given the unexpected shape. 
Two options I would think of:
   
   * Explicitly disallow conversion to numpy (I suppose we could raise an error 
in `__array__`, although would have to check if numpy doesn't still fallback to 
the current method then). And leave this to the user to do themselves (or go 
through another library that does this)
   * Actually implement `Table.__array__`. 
   
   A simple implementation (for us or for external users) could be 
`np.stack([np.asarray(col) for col in table], axis=1)`:
   
   ```
   In [14]: np.stack([np.asarray(col) for col in table], axis=1)
   Out[14]: 
   array([[1, 4],
          [2, 5],
          [3, 6]])
   ```
   
   I don't know if that will start to fail with more complex cases, though. 
Although it seems if the dtypes are not compatible, `np.stack` gives you object 
dtype instead of erroring.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] jorisvandenbossche commented on issue #34886: `np.asarray(parrow_table)` returns a transposed representation of the data

Reply via email to