[
https://issues.apache.org/jira/browse/ARROW-7782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17031729#comment-17031729
]
Joris Van den Bossche commented on ARROW-7782:
----------------------------------------------
Ah, OK. So the index information is preserved for single files (with your
dataframe above):
{code}
In [4]: df
Out[4]:
A B
idx
a 1 a
b 2 a
c 3 b
In [5]: df.to_parquet("test_index.parquet")
In [6]: pd.read_parquet("test_index.parquet")
Out[6]:
A B
idx
a 1 a
b 2 a
c 3 b
{code}
but for partitioned data this is more difficult.
The problem is the current implementation of {{write_to_dataset}}, which splits
the pandas dataframe in parts using pandas' groupby, and then writes those
parts to parquet. But in this current process, the index information is lost.
It might need some more investigation if it is easy/possible to still preserve
that information here.
> Losing index information when using write_to_dataset with partition_cols
> ------------------------------------------------------------------------
>
> Key: ARROW-7782
> URL: https://issues.apache.org/jira/browse/ARROW-7782
> Project: Apache Arrow
> Issue Type: Bug
> Environment: pyarrow==0.15.1
> Reporter: Ludwik Bielczynski
> Priority: Major
>
> One cannot save the index when using {{pyarrow.parquet.write_to_dataset()}}
> with given partition_cols arguments. Here I have created a minimal example
> which shows the issue:
> {code:java}
>
> from pathlib import Path
> import pandas as pd
> from pyarrow import Table
> from pyarrow.parquet import write_to_dataset, read_table
> path = Path('/home/user/trials')
> file_name = 'local_database.parquet'
> df = pd.DataFrame({"A": [1, 2, 3], "B": ['a', 'a', 'b']},
> index=pd.Index(['a', 'b', 'c'],
> name='idx'))
> table = Table.from_pandas(df)
> write_to_dataset(table,
> str(path / file_name),
> partition_cols=['B']
> )
> df_read = read_table(str(path / file_name))
> df_read.to_pandas()
> {code}
>
> The issue is rather important for pandas and dask users.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)