[jira] [Commented] (ARROW-7782) Losing index information when using write_to_dataset with partition_cols

Joris Van den Bossche (Jira) Thu, 06 Feb 2020 08:26:36 -0800


    [ 
https://issues.apache.org/jira/browse/ARROW-7782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17031729#comment-17031729
 ]


Joris Van den Bossche commented on ARROW-7782:
----------------------------------------------

Ah, OK. So the index information is preserved for single files (with your 
dataframe above):

{code}
In [4]: df                                                                      
                                                                                
                                                   
Out[4]: 
     A  B
idx      
a    1  a
b    2  a
c    3  b

In [5]: df.to_parquet("test_index.parquet")                                     
                                                                                
                                                   

In [6]: pd.read_parquet("test_index.parquet")                                   
                                                                                
                                                   
Out[6]: 
     A  B
idx      
a    1  a
b    2  a
c    3  b
{code}

but for partitioned data this is more difficult. 
The problem is the current implementation of {{write_to_dataset}}, which splits 
the pandas dataframe in parts using pandas' groupby, and then writes those 
parts to parquet. But in this current process, the index information is lost. 
It might need some more investigation if it is easy/possible to still preserve 
that information here.

 


> Losing index information when using write_to_dataset with partition_cols
> ------------------------------------------------------------------------
>
>                 Key: ARROW-7782
>                 URL: https://issues.apache.org/jira/browse/ARROW-7782
>             Project: Apache Arrow
>          Issue Type: Bug
>         Environment: pyarrow==0.15.1
>            Reporter: Ludwik Bielczynski
>            Priority: Major
>
> One cannot save the index when using {{pyarrow.parquet.write_to_dataset()}} 
> with given partition_cols arguments. Here I have created a minimal example 
> which shows the issue:
> {code:java}
>  
> from pathlib import Path
> import pandas as pd
> from pyarrow import Table
> from pyarrow.parquet import write_to_dataset, read_table
> path = Path('/home/user/trials')
> file_name = 'local_database.parquet'
> df = pd.DataFrame({"A": [1, 2, 3], "B": ['a', 'a', 'b']}, 
>                   index=pd.Index(['a', 'b', 'c'], 
>                   name='idx'))
> table = Table.from_pandas(df)
> write_to_dataset(table, 
>                  str(path / file_name), 
>                  partition_cols=['B']
>                 )
> df_read = read_table(str(path / file_name))
> df_read.to_pandas()
> {code}
>  
> The issue is rather important for pandas and dask users.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-7782) Losing index information when using write_to_dataset with partition_cols

Reply via email to