[GitHub] [arrow] rjzamora commented on pull request #7523: ARROW-8733: [Python][Dataset] Expose statistics of ParquetFileFragment::RowGroupInfo

GitBox Tue, 23 Jun 2020 09:21:03 -0700


rjzamora commented on pull request #7523:
URL: https://github.com/apache/arrow/pull/7523#issuecomment-648269136



   Thanks for working on this @jorisvandenbossche !
   
   This does seem like the functionality needed by Dask.  To test my 
understanding (and for the sake of discussion), I am imagining something 
(roughly) like the following in Dask to collect row-group statistics (note that 
I am using pyarrow-0.17.1 from conda, so the `get_row_group_fragments` call 
would be replaced):
   
   ```python
   from collections import defaultdict
   import json
   import pandas as pd
   import pyarrow.dataset as ds
   from string import ascii_lowercase as letters
   
   path = "simple.pdf"
   df0 = pd.DataFrame(
       {"x": range(26), "myindex": list(letters)}
   ).set_index("myindex")
   df0.to_parquet(path, engine="pyarrow", row_group_size=10)
   ds = pds.dataset(path)
   
   # Need index_cols to be specified by user or encoded in
   # the "pandas" metadata. Otherwise, we will not bother
   # to infer an index column (and wont need statistics).
   index_cols = json.loads(
       ds.schema.metadata[b"pandas"].decode("utf8")
   )["index_columns"]
   filter = None # Some user-defined filter
   
   # Collect path and statistics for each row-group
   metadata = defaultdict(list)
   for file_frag in ds.get_fragments(filter=filter):
       for rg_frag in file_frag.get_row_group_fragments():
           for rg in rg_frag.row_groups:
               stats = ds.get_min_max_statistics(rg.statistics)
               metadata[rg_frag.path].append((<rg-index>, stats))
   ```
   
   In this case, the resulting `metadata` object would be something like:
   ```
   defaultdict(list,
               {'simple.pdf': [(0, <stats-for-rg-0>),
                 (1, <stats-for-rg-1>),
                 (2, <stats-for-rg-2>)]})
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] rjzamora commented on pull request #7523: ARROW-8733: [Python][Dataset] Expose statistics of ParquetFileFragment::RowGroupInfo

Reply via email to