[ 
https://issues.apache.org/jira/browse/ARROW-7782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17031743#comment-17031743
 ] 

Ludwik Bielczynski edited comment on ARROW-7782 at 2/6/20 4:45 PM:
-------------------------------------------------------------------

Thanks Joris for checking this out. Yes for single files data the index is 
preserved, however, I am sure that you are aware that the usual use-case of 
parquet databases is not that simple. I think preserving this index in one case 
and resetting the index in the another can lead to one's confusion.

Please let me know when you have more information about the feasibility of this 
issue's correction.


was (Author: ludwikb):
Thanks, Joris for checking this out. Yes for single files data the index is 
preserved, however, I am sure that you are aware that the usual use-case of 
parquet databases is not that simple. I think preserving this index in one case 
and resetting the index in the another can lead to one's confusion.

Please let me know when you have more information about the feasibility of this 
issue's correction.

> Losing index information when using write_to_dataset with partition_cols
> ------------------------------------------------------------------------
>
>                 Key: ARROW-7782
>                 URL: https://issues.apache.org/jira/browse/ARROW-7782
>             Project: Apache Arrow
>          Issue Type: Bug
>         Environment: pyarrow==0.15.1
>            Reporter: Ludwik Bielczynski
>            Priority: Major
>
> One cannot save the index when using {{pyarrow.parquet.write_to_dataset()}} 
> with given partition_cols arguments. Here I have created a minimal example 
> which shows the issue:
> {code:java}
>  
> from pathlib import Path
> import pandas as pd
> from pyarrow import Table
> from pyarrow.parquet import write_to_dataset, read_table
> path = Path('/home/user/trials')
> file_name = 'local_database.parquet'
> df = pd.DataFrame({"A": [1, 2, 3], "B": ['a', 'a', 'b']}, 
>                   index=pd.Index(['a', 'b', 'c'], 
>                   name='idx'))
> table = Table.from_pandas(df)
> write_to_dataset(table, 
>                  str(path / file_name), 
>                  partition_cols=['B']
>                 )
> df_read = read_table(str(path / file_name))
> df_read.to_pandas()
> {code}
>  
> The issue is rather important for pandas and dask users.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to