[jira] [Updated] (ARROW-4350) [Python] nested numpy arrays

Joris Van den Bossche (JIRA) Fri, 07 Jun 2019 02:21:59 -0700


     [ 
https://issues.apache.org/jira/browse/ARROW-4350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Joris Van den Bossche updated ARROW-4350:
-----------------------------------------
    Description: 
Nested numpy arrays cannot be converted to a list-of-list type array:

{code:python}
arr = np.empty(2, dtype=object)
arr[:] = [np.array([1, 2]), np.array([2, 3])]

pa.array([arr, arr])
{code}

results in

{code}
ArrowTypeError: only size-1 arrays can be converted to Python scalars
{code}

Starting from lists of lists works fine:

{code:python}
lists = [[1, 2], [2, 3]]
pa.array([lists, lists]).type
{code}

{code:none}
ListType(list<item: list<item: int64>>)
{code}

Specifying the type explicitly as {{pa.array([arr, arr], 
type=pa.list_(pa.list_(pa.int64())))}} does not help.

Due to this, a round-trip is not working, as the list of list type gives back 
an array of arrays in python:

{code:python}
In [2]: lists = [[1, 2], [2, 3]] 
   ...: a = pa.array([lists, lists])                                            
                                                                                
                                                    

In [3]: a.to_pandas()                                                           
                                                                                
                                                    
Out[3]: 
array([array([array([1, 2]), array([2, 3])], dtype=object),
       array([array([1, 2]), array([2, 3])], dtype=object)], dtype=object)

In [4]: pa.array(a.to_pandas())                                                 
                                                                                
                                                    
---------------------------------------------------------------------------
ArrowTypeError                            Traceback (most recent call last)
<ipython-input-4-9fee6dc9d0b8> in <module>
----> 1 pa.array(a.to_pandas())

~/scipy/repos/arrow/python/pyarrow/array.pxi in pyarrow.lib.array()

~/scipy/repos/arrow/python/pyarrow/array.pxi in pyarrow.lib._ndarray_to_array()

~/scipy/repos/arrow/python/pyarrow/error.pxi in pyarrow.lib.check_status()

ArrowTypeError: only size-1 arrays can be converted to Python scalars
{code}




----
Origingal report:

{code:java}
In [19]: df = pd.DataFrame({'a': [[[1], [2]], [[2], [3]]], 'b': [1, 2]})

In [20]: df.iloc[0].to_dict()
Out[20]: {'a': [[1], [2]], 'b': 1}

In [21]: pa.Table.from_pandas(df).to_pandas().iloc[0].to_dict()
Out[21]: {'a': array([array([1]), array([2])], dtype=object), 'b': 1}

In [24]: np.array(df.iloc[0].to_dict()['a']).shape
Out[24]: (2, 1)

In [25]: pa.Table.from_pandas(df).to_pandas().iloc[0].to_dict()['a'].shape
Out[25]: (2,)
{code}
Adding extra array type is not functioning as expected. 

 

More importantly, this would fail

 
{code:java}
In [108]: df = pd.DataFrame({'a': [[[1, 2],[2, 3]], [[1,2], [2, 3]]], 'b': [[1, 
2],[2, 3]]})

In [109]: df
Out[109]:
a b
0 [[1, 2], [2, 3]] [1, 2]
1 [[1, 2], [2, 3]] [2, 3]

In [110]: pa.Table.from_pandas(pa.Table.from_pandas(df).to_pandas())
---------------------------------------------------------------------------
ArrowTypeError Traceback (most recent call last)
<ipython-input-110-4a09836f807e> in <module>()
----> 1 pa.Table.from_pandas(pa.Table.from_pandas(df).to_pandas())

/Users/pengyu/.pyenv/virtualenvs/starscream/2.7.11/lib/python2.7/site-packages/pyarrow/table.pxi
 in pyarrow.lib.Table.from_pandas()
1215 <pyarrow.lib.Table object at 0x7f05d1fb1b40>
1216 """
-> 1217 names, arrays, metadata = pdcompat.dataframe_to_arrays(
1218 df,
1219 schema=schema,

/Users/pengyu/.pyenv/virtualenvs/starscream/2.7.11/lib/python2.7/site-packages/pyarrow/pandas_compat.pyc
 in dataframe_to_arrays(df, schema, preserve_index, nthreads, columns, safe)
379 arrays = [convert_column(c, t)
380 for c, t in zip(columns_to_convert,
--> 381 convert_types)]
382 else:
383 from concurrent import futures

/Users/pengyu/.pyenv/virtualenvs/starscream/2.7.11/lib/python2.7/site-packages/pyarrow/pandas_compat.pyc
 in convert_column(col, ty)
374 e.args += ("Conversion failed for column {0!s} with type {1!s}"
375 .format(col.name, col.dtype),)
--> 376 raise e
377
378 if nthreads == 1:

ArrowTypeError: ('only size-1 arrays can be converted to Python scalars', 
'Conversion failed for column a with type object')

{code}
 

  was:
{code:java}
In [19]: df = pd.DataFrame({'a': [[[1], [2]], [[2], [3]]], 'b': [1, 2]})

In [20]: df.iloc[0].to_dict()
Out[20]: {'a': [[1], [2]], 'b': 1}

In [21]: pa.Table.from_pandas(df).to_pandas().iloc[0].to_dict()
Out[21]: {'a': array([array([1]), array([2])], dtype=object), 'b': 1}

In [24]: np.array(df.iloc[0].to_dict()['a']).shape
Out[24]: (2, 1)

In [25]: pa.Table.from_pandas(df).to_pandas().iloc[0].to_dict()['a'].shape
Out[25]: (2,)
{code}
Adding extra array type is not functioning as expected. 

 

More importantly, this would fail

 
{code:java}
In [108]: df = pd.DataFrame({'a': [[[1, 2],[2, 3]], [[1,2], [2, 3]]], 'b': [[1, 
2],[2, 3]]})

In [109]: df
Out[109]:
a b
0 [[1, 2], [2, 3]] [1, 2]
1 [[1, 2], [2, 3]] [2, 3]

In [110]: pa.Table.from_pandas(pa.Table.from_pandas(df).to_pandas())
---------------------------------------------------------------------------
ArrowTypeError Traceback (most recent call last)
<ipython-input-110-4a09836f807e> in <module>()
----> 1 pa.Table.from_pandas(pa.Table.from_pandas(df).to_pandas())

/Users/pengyu/.pyenv/virtualenvs/starscream/2.7.11/lib/python2.7/site-packages/pyarrow/table.pxi
 in pyarrow.lib.Table.from_pandas()
1215 <pyarrow.lib.Table object at 0x7f05d1fb1b40>
1216 """
-> 1217 names, arrays, metadata = pdcompat.dataframe_to_arrays(
1218 df,
1219 schema=schema,

/Users/pengyu/.pyenv/virtualenvs/starscream/2.7.11/lib/python2.7/site-packages/pyarrow/pandas_compat.pyc
 in dataframe_to_arrays(df, schema, preserve_index, nthreads, columns, safe)
379 arrays = [convert_column(c, t)
380 for c, t in zip(columns_to_convert,
--> 381 convert_types)]
382 else:
383 from concurrent import futures

/Users/pengyu/.pyenv/virtualenvs/starscream/2.7.11/lib/python2.7/site-packages/pyarrow/pandas_compat.pyc
 in convert_column(col, ty)
374 e.args += ("Conversion failed for column {0!s} with type {1!s}"
375 .format(col.name, col.dtype),)
--> 376 raise e
377
378 if nthreads == 1:

ArrowTypeError: ('only size-1 arrays can be converted to Python scalars', 
'Conversion failed for column a with type object')

{code}
 


> [Python] nested numpy arrays
> ----------------------------
>
>                 Key: ARROW-4350
>                 URL: https://issues.apache.org/jira/browse/ARROW-4350
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 0.11.1, 0.12.0
>            Reporter: yu peng
>            Priority: Major
>             Fix For: 0.14.0
>
>
> Nested numpy arrays cannot be converted to a list-of-list type array:
> {code:python}
> arr = np.empty(2, dtype=object)
> arr[:] = [np.array([1, 2]), np.array([2, 3])]
> pa.array([arr, arr])
> {code}
> results in
> {code}
> ArrowTypeError: only size-1 arrays can be converted to Python scalars
> {code}
> Starting from lists of lists works fine:
> {code:python}
> lists = [[1, 2], [2, 3]]
> pa.array([lists, lists]).type
> {code}
> {code:none}
> ListType(list<item: list<item: int64>>)
> {code}
> Specifying the type explicitly as {{pa.array([arr, arr], 
> type=pa.list_(pa.list_(pa.int64())))}} does not help.
> Due to this, a round-trip is not working, as the list of list type gives back 
> an array of arrays in python:
> {code:python}
> In [2]: lists = [[1, 2], [2, 3]] 
>    ...: a = pa.array([lists, lists])                                          
>                                                                               
>                                                         
> In [3]: a.to_pandas()                                                         
>                                                                               
>                                                         
> Out[3]: 
> array([array([array([1, 2]), array([2, 3])], dtype=object),
>        array([array([1, 2]), array([2, 3])], dtype=object)], dtype=object)
> In [4]: pa.array(a.to_pandas())                                               
>                                                                               
>                                                         
> ---------------------------------------------------------------------------
> ArrowTypeError                            Traceback (most recent call last)
> <ipython-input-4-9fee6dc9d0b8> in <module>
> ----> 1 pa.array(a.to_pandas())
> ~/scipy/repos/arrow/python/pyarrow/array.pxi in pyarrow.lib.array()
> ~/scipy/repos/arrow/python/pyarrow/array.pxi in 
> pyarrow.lib._ndarray_to_array()
> ~/scipy/repos/arrow/python/pyarrow/error.pxi in pyarrow.lib.check_status()
> ArrowTypeError: only size-1 arrays can be converted to Python scalars
> {code}
> ----
> Origingal report:
> {code:java}
> In [19]: df = pd.DataFrame({'a': [[[1], [2]], [[2], [3]]], 'b': [1, 2]})
> In [20]: df.iloc[0].to_dict()
> Out[20]: {'a': [[1], [2]], 'b': 1}
> In [21]: pa.Table.from_pandas(df).to_pandas().iloc[0].to_dict()
> Out[21]: {'a': array([array([1]), array([2])], dtype=object), 'b': 1}
> In [24]: np.array(df.iloc[0].to_dict()['a']).shape
> Out[24]: (2, 1)
> In [25]: pa.Table.from_pandas(df).to_pandas().iloc[0].to_dict()['a'].shape
> Out[25]: (2,)
> {code}
> Adding extra array type is not functioning as expected. 
>  
> More importantly, this would fail
>  
> {code:java}
> In [108]: df = pd.DataFrame({'a': [[[1, 2],[2, 3]], [[1,2], [2, 3]]], 'b': 
> [[1, 2],[2, 3]]})
> In [109]: df
> Out[109]:
> a b
> 0 [[1, 2], [2, 3]] [1, 2]
> 1 [[1, 2], [2, 3]] [2, 3]
> In [110]: pa.Table.from_pandas(pa.Table.from_pandas(df).to_pandas())
> ---------------------------------------------------------------------------
> ArrowTypeError Traceback (most recent call last)
> <ipython-input-110-4a09836f807e> in <module>()
> ----> 1 pa.Table.from_pandas(pa.Table.from_pandas(df).to_pandas())
> /Users/pengyu/.pyenv/virtualenvs/starscream/2.7.11/lib/python2.7/site-packages/pyarrow/table.pxi
>  in pyarrow.lib.Table.from_pandas()
> 1215 <pyarrow.lib.Table object at 0x7f05d1fb1b40>
> 1216 """
> -> 1217 names, arrays, metadata = pdcompat.dataframe_to_arrays(
> 1218 df,
> 1219 schema=schema,
> /Users/pengyu/.pyenv/virtualenvs/starscream/2.7.11/lib/python2.7/site-packages/pyarrow/pandas_compat.pyc
>  in dataframe_to_arrays(df, schema, preserve_index, nthreads, columns, safe)
> 379 arrays = [convert_column(c, t)
> 380 for c, t in zip(columns_to_convert,
> --> 381 convert_types)]
> 382 else:
> 383 from concurrent import futures
> /Users/pengyu/.pyenv/virtualenvs/starscream/2.7.11/lib/python2.7/site-packages/pyarrow/pandas_compat.pyc
>  in convert_column(col, ty)
> 374 e.args += ("Conversion failed for column {0!s} with type {1!s}"
> 375 .format(col.name, col.dtype),)
> --> 376 raise e
> 377
> 378 if nthreads == 1:
> ArrowTypeError: ('only size-1 arrays can be converted to Python scalars', 
> 'Conversion failed for column a with type object')
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (ARROW-4350) [Python] nested numpy arrays

Reply via email to