[ 
https://issues.apache.org/jira/browse/ARROW-14772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17446622#comment-17446622
 ] 

Weston Pace commented on ARROW-14772:
-------------------------------------

Oh, I can probably explain it.  The partition column is returned as a 
dictionary encoded string (in pandas this is converted to a category).  There 
is only one value but the "dictionary" part has all the possibilities.  You can 
see this above here:
{noformat}
Unique dates after restore:
0    2021-11-13
Name: date, dtype: category
Categories (5, object): ['2021-11-13', '2021-11-14', '2021-11-15', 
'2021-11-16', '2021-11-17'] {noformat}
Even though there is only one unique value in the array (2021-11-13) there are 
five different values in the dictionary part (five categories in pandas).

So my question isn't "how is this happening" but "what behavior do we want?"  
For futher, example, note that we get the exact same result if we read in the 
entire dataset and do the filtering in Pandas.
{noformat}
restored_df = pd.read_parquet(dataset_dir, engine='pyarrow', 
use_legacy_dataset=False)
restored_df[restored_df['date'] == str(start_date)]{noformat}
I can see arguments for both sides.  On the one hand there is a bunch of 
unexpected and often useless info.  On the other hand there may be rare cases 
where it would be handy to know what the full range of possible values was.

> [Python] unexpected content after groupby on a dataframe restored from 
> partitioned parquet with filters
> -------------------------------------------------------------------------------------------------------
>
>                 Key: ARROW-14772
>                 URL: https://issues.apache.org/jira/browse/ARROW-14772
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Parquet, Python
>    Affects Versions: 6.0.1
>            Reporter: Vadim Mironov
>            Priority: Major
>              Labels: scanner
>
> While experimenting with the partitioned dataset persistence in parquet, I 
> stumbled upon an interesting feature (or bug?) where after restoring only a 
> certain partition and applying groupby I suddenly get all the filtered rows 
> in the dataframe. 
> Following code demonstrates the issue:
> {code:java}
> import numpy as np
> import os
> import pandas as pd  # 1.3.4
> import pyarrow as pa  # 6.0.1
> import random
> import shutil
> import string
> import tempfile
> from datetime import datetime, timedelta
> if __name__ == '__main__':
>     # 1. generate random data frame
>     day_count = 5
>     data_length = 10
>     numpy_random_gen = np.random.default_rng()
>     label_choices = [''.join(random.choices(string.ascii_uppercase + 
> string.digits, k=8)) for _ in range(5)]
>     partial_dfs = []
>     start_date = datetime.today().date() - timedelta(days=day_count)
>     for date in (start_date + timedelta(n) for n in range(day_count)):
>         date_array = pd.to_datetime(np.full(data_length, date)).date
>         label_array = np.full(data_length, [random.choice(label_choices) for 
> _ in range(data_length)])
>         value_array = numpy_random_gen.integers(low=1, high=500, 
> size=data_length)
>         partial_dfs.append(pd.DataFrame(data={'date': date_array, 'label': 
> label_array, 'value': value_array}))
>     df = pd.concat(partial_dfs, ignore_index=True)
>     print(f"Unique dates before 
> restore:\n{df.drop_duplicates(subset='date')['date']}")
>     # 2. persist data frame partitioned by date
>     dataset_dir = tempfile.mkdtemp()
>     df.to_parquet(path=dataset_dir, engine='pyarrow', partition_cols=['date', 
> 'label'])
>     # 3. restore from parquet partitioned dataset
>     restored_df = pd.read_parquet(dataset_dir, engine='pyarrow', filters=[
>         ('date', '=', str(start_date))], use_legacy_dataset=False)
>     print(f"Unique dates after 
> restore:\n{restored_df.drop_duplicates(subset='date')['date']}")
>     group_by_df = restored_df.groupby(by=['date', 
> 'label'])['value'].sum().reset_index(name='val_sum')
>     print(group_by_df)
>     shutil.rmtree(dataset_dir) {code}
> It correctly reports five unique dates upon random df generation and 
> correctly reports only one after reading back from parquet:
> {noformat}
> Unique dates after restore:
> 0    2021-11-13
> Name: date, dtype: category
> Categories (5, object): ['2021-11-13', '2021-11-14', '2021-11-15', 
> '2021-11-16', '2021-11-17']{noformat}
> Albeit it adds that there are 5 categories. When subsequently I perform a 
> groupby, all dates that were filtered out at read miracolously appear:
> {code:java}
>     group_by_df = restored_df.groupby(by=['date', 
> 'label'])['value'].sum().reset_index(name='val_sum')
>     print(group_by_df)
> {code}
> With the following output:
> {noformat}
>           date     label  val_sum
> 0   2021-11-13  04LOXJCH      494
> 1   2021-11-13  4QOZ321D      819
> 2   2021-11-13  GG6YO5FS      394
> 3   2021-11-13  J7ZD3LDS      203
> 4   2021-11-13  TFVIXE6L      164
> 5   2021-11-14  04LOXJCH        0
> 6   2021-11-14  4QOZ321D        0
> 7   2021-11-14  GG6YO5FS        0
> 8   2021-11-14  J7ZD3LDS        0
> 9   2021-11-14  TFVIXE6L        0
> 10  2021-11-15  04LOXJCH        0
> 11  2021-11-15  4QOZ321D        0
> 12  2021-11-15  GG6YO5FS        0
> 13  2021-11-15  J7ZD3LDS        0
> 14  2021-11-15  TFVIXE6L        0
> 15  2021-11-16  04LOXJCH        0
> 16  2021-11-16  4QOZ321D        0
> 17  2021-11-16  GG6YO5FS        0
> 18  2021-11-16  J7ZD3LDS        0
> 19  2021-11-16  TFVIXE6L        0
> 20  2021-11-17  04LOXJCH        0
> 21  2021-11-17  4QOZ321D        0
> 22  2021-11-17  GG6YO5FS        0
> 23  2021-11-17  J7ZD3LDS        0
> 24  2021-11-17  TFVIXE6L        0{noformat}
> Perhaps I am doing something incorrectly within read_parquet call or 
> something, but my expectation would be for filtered data just be gone after 
> the read operation.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to