[jira] [Commented] (ARROW-4814) [Python] Exception when writing nested columns that are tuples to parquet

Joris Van den Bossche (JIRA) Wed, 08 May 2019 08:12:11 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-4814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16835671#comment-16835671
 ]


Joris Van den Bossche commented on ARROW-4814:
----------------------------------------------

This is actually a different issue (not related to Parquet). The failure 
already happens on the conversion to an arrow Table, because pyarrow does not 
yet support an array of tuples out of the box:

{code}
In [75]: df = pd.DataFrame({'tuples': pd.Series([('A'), ('B',)], 
dtype=object)})                                                                 
             

In [76]: pa.Table.from_pandas(df)                                               
                                                                              
...
ArrowTypeError: ("Expected a bytes object, got a 'tuple' object", 'Conversion 
failed for column tuples with type object')
{code}

However, if you specify a schema indicating a ListType of strings, it works:

{code}
In [77]: schema = pa.schema([('tuples', pa.list_(pa.string()))])                
                                                                              

In [78]: pa.Table.from_pandas(df, schema=schema)                                
                                                                              
Out[78]: 
pyarrow.Table
tuples: list<item: string>
  child 0, item: string
metadata
--------
{b'pandas': b'{"index_columns": [{"kind": "range", "name": null, "start": 0, "'
            b'stop": 2, "step": 1}], "column_indexes": [{"name": null, "field_'
            b'name": null, "pandas_type": "unicode", "numpy_type": "object", "'
            b'metadata": {"encoding": "UTF-8"}}], "columns": [{"name": "tuples'
            b'", "field_name": "tuples", "pandas_type": "list[unicode]", "nump'
            b'y_type": "object", "metadata": null}], "creator": {"library": "p'
            b'yarrow", "version": "0.13.1.dev130+gdd335952"}, "pandas_version"'
            b': "0.24.2"}'}
{code}

and such a table also writes to Parquet fine.

So the actual issue is more about: support for inferring tuples (I removed the 
parquet label)

> [Python] Exception when writing nested columns that are tuples to parquet
> -------------------------------------------------------------------------
>
>                 Key: ARROW-4814
>                 URL: https://issues.apache.org/jira/browse/ARROW-4814
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 0.12.1
>         Environment: 4.20.8-100.fc28.x86_64
>            Reporter: Suvayu Ali
>            Priority: Major
>              Labels: pandas
>         Attachments: df_to_parquet_fail.py, test.csv
>
>
> I get an exception when I try to write a {{pandas.DataFrame}} to a parquet 
> file where one of the columns has tuples in them.  I use tuples here because 
> it allows for easier querying in pandas (see ARROW-3806 for a more detailed 
> description).
> {code}
> Traceback (most recent call last):
>   File "df_to_parquet_fail.py", line 5, in <module>
>     df.to_parquet("test.parquet")  # crashes
>   File "/home/user/.local/lib/python3.6/site-packages/pandas/core/frame.py", 
> line 2203, in to_parquet                                                      
>                                  
>     partition_cols=partition_cols, **kwargs)
>   File "/home/user/.local/lib/python3.6/site-packages/pandas/io/parquet.py", 
> line 252, in to_parquet                                                       
>                                  
>     partition_cols=partition_cols, **kwargs)
>   File "/home/user/.local/lib/python3.6/site-packages/pandas/io/parquet.py", 
> line 113, in write                                                            
>                                  
>     table = self.api.Table.from_pandas(df, **from_pandas_kwargs)
>   File "pyarrow/table.pxi", line 1141, in pyarrow.lib.Table.from_pandas
>   File 
> "/home/user/.local/lib/python3.6/site-packages/pyarrow/pandas_compat.py", 
> line 431, in dataframe_to_arrays                                              
>                              
>     convert_types)]
>   File 
> "/home/user/.local/lib/python3.6/site-packages/pyarrow/pandas_compat.py", 
> line 430, in <listcomp>                                                       
>                              
>     for c, t in zip(columns_to_convert,
>   File 
> "/home/user/.local/lib/python3.6/site-packages/pyarrow/pandas_compat.py", 
> line 426, in convert_column                                                   
>                              
>     raise e
>   File 
> "/home/user/.local/lib/python3.6/site-packages/pyarrow/pandas_compat.py", 
> line 420, in convert_column                                                   
>                              
>     return pa.array(col, type=ty, from_pandas=True, safe=safe)
>   File "pyarrow/array.pxi", line 176, in pyarrow.lib.array
>   File "pyarrow/array.pxi", line 85, in pyarrow.lib._ndarray_to_array
>   File "pyarrow/error.pxi", line 81, in pyarrow.lib.check_status
> pyarrow.lib.ArrowInvalid: ("Could not convert ('G',) with type tuple: did not 
> recognize Python value type when inferring an Arrow data type", 'Conversion 
> failed for column ALTS with type object')
> {code}
> The issue maybe replicated with the attached script and csv file.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (ARROW-4814) [Python] Exception when writing nested columns that are tuples to parquet

Reply via email to