tmontes commented on issue #44352:
URL: https://github.com/apache/arrow/issues/44352#issuecomment-2402144910
WORKING VARIATION:
```
import pathlib
import tempfile
import pandas as pd
import pyarrow.dataset as ds
YEAR_COLUMN = '_year'
FILE_COLUMN = '_file'
with tempfile.TemporaryDirectory() as td:
dataset_path = pathlib.Path(td) / 'dataset'
# create parquet dataset partitioned by YEAR_COLUMN / FILE_COLUMN
pd.DataFrame([
{'data': 0, YEAR_COLUMN: 2020, FILE_COLUMN: 'a'},
{'data': 1, YEAR_COLUMN: 2020, FILE_COLUMN: 'a'},
{'data': 2, YEAR_COLUMN: 2020, FILE_COLUMN: 'b'},
{'data': 4, YEAR_COLUMN: 2020, FILE_COLUMN: 'b'},
{'data': 5, YEAR_COLUMN: 2020, FILE_COLUMN: 'b'},
{'data': 6, YEAR_COLUMN: 2021, FILE_COLUMN: 'b'},
{'data': 7, YEAR_COLUMN: 2021, FILE_COLUMN: 'c'},
{'data': 8, YEAR_COLUMN: 2021, FILE_COLUMN: 'c'},
]).to_parquet(
dataset_path,
partition_cols=[YEAR_COLUMN, FILE_COLUMN],
index=False,
)
# get dataset row_count for a given FILE_COLUMN value: 'a' in this case
dataset = ds.dataset(
dataset_path,
partitioning=ds.partitioning(flavor='hive'),
# required not to ignore the leading underscores in YEAR/FILE_COLUMN
ignore_prefixes=['.'],
)
row_count_for_file_a = sum(
batch.num_rows
for batch in dataset.to_batches(
columns=[YEAR_COLUMN],
filter=(ds.field(FILE_COLUMN) == 'a')
)
)
assert row_count_for_file_a == 2
```
CLOSING
PS: Thanks and sorry for the noise! :-)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]