rjzamora commented on pull request #7523:
URL: https://github.com/apache/arrow/pull/7523#issuecomment-648269136
Thanks for working on this @jorisvandenbossche !
This does seem like the functionality needed by Dask. To test my
understanding (and for the sake of discussion), I am imagining something
(roughly) like the following in Dask to collect row-group statistics (note that
I am using pyarrow-0.17.1 from conda, so the `get_row_group_fragments` call
would be replaced):
```python
from collections import defaultdict
import json
import pandas as pd
import pyarrow.dataset as ds
from string import ascii_lowercase as letters
path = "simple.pdf"
df0 = pd.DataFrame(
{"x": range(26), "myindex": list(letters)}
).set_index("myindex")
df0.to_parquet(path, engine="pyarrow", row_group_size=10)
ds = pds.dataset(path)
# Need index_cols to be specified by user or encoded in
# the "pandas" metadata. Otherwise, we will not bother
# to infer an index column (and wont need statistics).
index_cols = json.loads(
ds.schema.metadata[b"pandas"].decode("utf8")
)["index_columns"]
filter = None # Some user-defined filter
# Collect path and statistics for each row-group
metadata = defaultdict(list)
for file_frag in ds.get_fragments(filter=filter):
for rg_frag in file_frag.get_row_group_fragments():
for rg in rg_frag.row_groups:
stats = ds.get_min_max_statistics(rg.statistics)
metadata[rg_frag.path].append((<rg-index>, stats))
```
In this case, the resulting `metadata` object would be something like:
```
defaultdict(list,
{'simple.pdf': [(0, <stats-for-rg-0>),
(1, <stats-for-rg-1>),
(2, <stats-for-rg-2>)]})
```
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]