[
https://issues.apache.org/jira/browse/ARROW-14772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17447399#comment-17447399
]
Joris Van den Bossche commented on ARROW-14772:
-----------------------------------------------
Sorry for the late response here, this was indeed a classical case of "all
_known_ categories versus all _present_ categories" confusion in pandas, for
which the {{observed}} keyword was added, as mentioned above.
> I could not find in pandas git tracker any open issues for dataset pyarrow
> API transition. Would you happen to know if this is something on their 1.4
> roadmap or maybe it would be worth for me go there and raise something on the
> back of this curious find?
Pandas is actually already using the datasets code under the hood, but just
through the {{pq.read_table}} interface instead of through the {{ds.dataset}}
interface (since the default for {{pq.read_table}} is to already use datasets
under the hood). So I am not sure it is necessary for pandas to explicitly
transition to the datasets interface ({{pq.read_table}} has some useful
boilerplate around {{ds.datasets}} that otherwise would need to be replicated
in pandas).
But it would certainly be useful to consider the few behavioural differences,
such as the dictionary encoding for the partition columns.
For example, something pandas could add (or pyarrow in {{pq.read_table}}) is a
keyword to disable reading partition columns as dictionary/categorical data
type.
> [Python] unexpected content after groupby on a dataframe restored from
> partitioned parquet with filters
> -------------------------------------------------------------------------------------------------------
>
> Key: ARROW-14772
> URL: https://issues.apache.org/jira/browse/ARROW-14772
> Project: Apache Arrow
> Issue Type: Bug
> Components: Parquet, Python
> Affects Versions: 6.0.1
> Reporter: Vadim Mironov
> Priority: Major
> Labels: scanner
>
> While experimenting with the partitioned dataset persistence in parquet, I
> stumbled upon an interesting feature (or bug?) where after restoring only a
> certain partition and applying groupby I suddenly get all the filtered rows
> in the dataframe.
> Following code demonstrates the issue:
> {code:java}
> import numpy as np
> import os
> import pandas as pd # 1.3.4
> import pyarrow as pa # 6.0.1
> import random
> import shutil
> import string
> import tempfile
> from datetime import datetime, timedelta
> if __name__ == '__main__':
> # 1. generate random data frame
> day_count = 5
> data_length = 10
> numpy_random_gen = np.random.default_rng()
> label_choices = [''.join(random.choices(string.ascii_uppercase +
> string.digits, k=8)) for _ in range(5)]
> partial_dfs = []
> start_date = datetime.today().date() - timedelta(days=day_count)
> for date in (start_date + timedelta(n) for n in range(day_count)):
> date_array = pd.to_datetime(np.full(data_length, date)).date
> label_array = np.full(data_length, [random.choice(label_choices) for
> _ in range(data_length)])
> value_array = numpy_random_gen.integers(low=1, high=500,
> size=data_length)
> partial_dfs.append(pd.DataFrame(data={'date': date_array, 'label':
> label_array, 'value': value_array}))
> df = pd.concat(partial_dfs, ignore_index=True)
> print(f"Unique dates before
> restore:\n{df.drop_duplicates(subset='date')['date']}")
> # 2. persist data frame partitioned by date
> dataset_dir = tempfile.mkdtemp()
> df.to_parquet(path=dataset_dir, engine='pyarrow', partition_cols=['date',
> 'label'])
> # 3. restore from parquet partitioned dataset
> restored_df = pd.read_parquet(dataset_dir, engine='pyarrow', filters=[
> ('date', '=', str(start_date))], use_legacy_dataset=False)
> print(f"Unique dates after
> restore:\n{restored_df.drop_duplicates(subset='date')['date']}")
> group_by_df = restored_df.groupby(by=['date',
> 'label'])['value'].sum().reset_index(name='val_sum')
> print(group_by_df)
> shutil.rmtree(dataset_dir) {code}
> It correctly reports five unique dates upon random df generation and
> correctly reports only one after reading back from parquet:
> {noformat}
> Unique dates after restore:
> 0 2021-11-13
> Name: date, dtype: category
> Categories (5, object): ['2021-11-13', '2021-11-14', '2021-11-15',
> '2021-11-16', '2021-11-17']{noformat}
> Albeit it adds that there are 5 categories. When subsequently I perform a
> groupby, all dates that were filtered out at read miracolously appear:
> {code:java}
> group_by_df = restored_df.groupby(by=['date',
> 'label'])['value'].sum().reset_index(name='val_sum')
> print(group_by_df)
> {code}
> With the following output:
> {noformat}
> date label val_sum
> 0 2021-11-13 04LOXJCH 494
> 1 2021-11-13 4QOZ321D 819
> 2 2021-11-13 GG6YO5FS 394
> 3 2021-11-13 J7ZD3LDS 203
> 4 2021-11-13 TFVIXE6L 164
> 5 2021-11-14 04LOXJCH 0
> 6 2021-11-14 4QOZ321D 0
> 7 2021-11-14 GG6YO5FS 0
> 8 2021-11-14 J7ZD3LDS 0
> 9 2021-11-14 TFVIXE6L 0
> 10 2021-11-15 04LOXJCH 0
> 11 2021-11-15 4QOZ321D 0
> 12 2021-11-15 GG6YO5FS 0
> 13 2021-11-15 J7ZD3LDS 0
> 14 2021-11-15 TFVIXE6L 0
> 15 2021-11-16 04LOXJCH 0
> 16 2021-11-16 4QOZ321D 0
> 17 2021-11-16 GG6YO5FS 0
> 18 2021-11-16 J7ZD3LDS 0
> 19 2021-11-16 TFVIXE6L 0
> 20 2021-11-17 04LOXJCH 0
> 21 2021-11-17 4QOZ321D 0
> 22 2021-11-17 GG6YO5FS 0
> 23 2021-11-17 J7ZD3LDS 0
> 24 2021-11-17 TFVIXE6L 0{noformat}
> Perhaps I am doing something incorrectly within read_parquet call or
> something, but my expectation would be for filtered data just be gone after
> the read operation.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)