[
https://issues.apache.org/jira/browse/ARROW-7782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17031743#comment-17031743
]
Ludwik Bielczynski edited comment on ARROW-7782 at 2/6/20 4:45 PM:
-------------------------------------------------------------------
Thanks Joris for checking this out. Yes for single files data the index is
preserved, however, I am sure that you are aware that the usual use-case of
parquet databases is not that simple. I think preserving this index in one case
and resetting the index in the another can lead to one's confusion.
Please let me know when you have more information about the feasibility of this
issue's correction.
was (Author: ludwikb):
Thanks, Joris for checking this out. Yes for single files data the index is
preserved, however, I am sure that you are aware that the usual use-case of
parquet databases is not that simple. I think preserving this index in one case
and resetting the index in the another can lead to one's confusion.
Please let me know when you have more information about the feasibility of this
issue's correction.
> Losing index information when using write_to_dataset with partition_cols
> ------------------------------------------------------------------------
>
> Key: ARROW-7782
> URL: https://issues.apache.org/jira/browse/ARROW-7782
> Project: Apache Arrow
> Issue Type: Bug
> Environment: pyarrow==0.15.1
> Reporter: Ludwik Bielczynski
> Priority: Major
>
> One cannot save the index when using {{pyarrow.parquet.write_to_dataset()}}
> with given partition_cols arguments. Here I have created a minimal example
> which shows the issue:
> {code:java}
>
> from pathlib import Path
> import pandas as pd
> from pyarrow import Table
> from pyarrow.parquet import write_to_dataset, read_table
> path = Path('/home/user/trials')
> file_name = 'local_database.parquet'
> df = pd.DataFrame({"A": [1, 2, 3], "B": ['a', 'a', 'b']},
> index=pd.Index(['a', 'b', 'c'],
> name='idx'))
> table = Table.from_pandas(df)
> write_to_dataset(table,
> str(path / file_name),
> partition_cols=['B']
> )
> df_read = read_table(str(path / file_name))
> df_read.to_pandas()
> {code}
>
> The issue is rather important for pandas and dask users.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)