[ 
https://issues.apache.org/jira/browse/ARROW-14772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17446771#comment-17446771
 ] 

Vadim Mironov commented on ARROW-14772:
---------------------------------------

I see. Thanks Weston and I confirm that either of the two ways gives exactly 
what I would expect to see as a result of a groupby. Certainly, the suggestion 
to use a  pyarrow.dataset API directly is the most appealing one as I am quite 
hesitant to retrofit all the pandas groupby calls with observed=True. 

I will close this issue, but perhaps one last question - I could not find in 
pandas git tracker any open issues for dataset pyarrow API transition. Would 
you happen to know if this is something on their 1.4 roadmap or maybe it would 
be worth for me go there and raise something on the back of this curious find?

Definitely want to leave some trace for others who might stumble on the same 
issue and it feels like it belongs more to pandas tracker if anything. 

> [Python] unexpected content after groupby on a dataframe restored from 
> partitioned parquet with filters
> -------------------------------------------------------------------------------------------------------
>
>                 Key: ARROW-14772
>                 URL: https://issues.apache.org/jira/browse/ARROW-14772
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Parquet, Python
>    Affects Versions: 6.0.1
>            Reporter: Vadim Mironov
>            Priority: Major
>              Labels: scanner
>
> While experimenting with the partitioned dataset persistence in parquet, I 
> stumbled upon an interesting feature (or bug?) where after restoring only a 
> certain partition and applying groupby I suddenly get all the filtered rows 
> in the dataframe. 
> Following code demonstrates the issue:
> {code:java}
> import numpy as np
> import os
> import pandas as pd  # 1.3.4
> import pyarrow as pa  # 6.0.1
> import random
> import shutil
> import string
> import tempfile
> from datetime import datetime, timedelta
> if __name__ == '__main__':
>     # 1. generate random data frame
>     day_count = 5
>     data_length = 10
>     numpy_random_gen = np.random.default_rng()
>     label_choices = [''.join(random.choices(string.ascii_uppercase + 
> string.digits, k=8)) for _ in range(5)]
>     partial_dfs = []
>     start_date = datetime.today().date() - timedelta(days=day_count)
>     for date in (start_date + timedelta(n) for n in range(day_count)):
>         date_array = pd.to_datetime(np.full(data_length, date)).date
>         label_array = np.full(data_length, [random.choice(label_choices) for 
> _ in range(data_length)])
>         value_array = numpy_random_gen.integers(low=1, high=500, 
> size=data_length)
>         partial_dfs.append(pd.DataFrame(data={'date': date_array, 'label': 
> label_array, 'value': value_array}))
>     df = pd.concat(partial_dfs, ignore_index=True)
>     print(f"Unique dates before 
> restore:\n{df.drop_duplicates(subset='date')['date']}")
>     # 2. persist data frame partitioned by date
>     dataset_dir = tempfile.mkdtemp()
>     df.to_parquet(path=dataset_dir, engine='pyarrow', partition_cols=['date', 
> 'label'])
>     # 3. restore from parquet partitioned dataset
>     restored_df = pd.read_parquet(dataset_dir, engine='pyarrow', filters=[
>         ('date', '=', str(start_date))], use_legacy_dataset=False)
>     print(f"Unique dates after 
> restore:\n{restored_df.drop_duplicates(subset='date')['date']}")
>     group_by_df = restored_df.groupby(by=['date', 
> 'label'])['value'].sum().reset_index(name='val_sum')
>     print(group_by_df)
>     shutil.rmtree(dataset_dir) {code}
> It correctly reports five unique dates upon random df generation and 
> correctly reports only one after reading back from parquet:
> {noformat}
> Unique dates after restore:
> 0    2021-11-13
> Name: date, dtype: category
> Categories (5, object): ['2021-11-13', '2021-11-14', '2021-11-15', 
> '2021-11-16', '2021-11-17']{noformat}
> Albeit it adds that there are 5 categories. When subsequently I perform a 
> groupby, all dates that were filtered out at read miracolously appear:
> {code:java}
>     group_by_df = restored_df.groupby(by=['date', 
> 'label'])['value'].sum().reset_index(name='val_sum')
>     print(group_by_df)
> {code}
> With the following output:
> {noformat}
>           date     label  val_sum
> 0   2021-11-13  04LOXJCH      494
> 1   2021-11-13  4QOZ321D      819
> 2   2021-11-13  GG6YO5FS      394
> 3   2021-11-13  J7ZD3LDS      203
> 4   2021-11-13  TFVIXE6L      164
> 5   2021-11-14  04LOXJCH        0
> 6   2021-11-14  4QOZ321D        0
> 7   2021-11-14  GG6YO5FS        0
> 8   2021-11-14  J7ZD3LDS        0
> 9   2021-11-14  TFVIXE6L        0
> 10  2021-11-15  04LOXJCH        0
> 11  2021-11-15  4QOZ321D        0
> 12  2021-11-15  GG6YO5FS        0
> 13  2021-11-15  J7ZD3LDS        0
> 14  2021-11-15  TFVIXE6L        0
> 15  2021-11-16  04LOXJCH        0
> 16  2021-11-16  4QOZ321D        0
> 17  2021-11-16  GG6YO5FS        0
> 18  2021-11-16  J7ZD3LDS        0
> 19  2021-11-16  TFVIXE6L        0
> 20  2021-11-17  04LOXJCH        0
> 21  2021-11-17  4QOZ321D        0
> 22  2021-11-17  GG6YO5FS        0
> 23  2021-11-17  J7ZD3LDS        0
> 24  2021-11-17  TFVIXE6L        0{noformat}
> Perhaps I am doing something incorrectly within read_parquet call or 
> something, but my expectation would be for filtered data just be gone after 
> the read operation.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to