[GitHub] [arrow] rjzamora commented on pull request #7546: ARROW-8733: [C++][Dataset][Python] Expose RowGroupInfo statistics values

GitBox Mon, 29 Jun 2020 08:00:41 -0700


rjzamora commented on pull request #7546:
URL: https://github.com/apache/arrow/pull/7546#issuecomment-651177576



   Thanks for the great work here @bkietz !
   
   This is wonderful - Dask uses the min/max statistics to calculate 
`divisions`, so this functionality is definitely necessary.
   
    *A note on other (less-critical, but useful) statistics*:
   Dask also uses the `"total_byte_size"` statistics (for the full row-group, 
not each column) to aggregate partitions before reading in any data.  There is 
also a plan to use the `"num-rows”` statistics when the user executes 
`len(ddf)` (to avoid loading any data).   **How difficult would it be to 
add/expose these additional row-group statistics?**  Again, this is much less 
of a “blocker” for initial integration with Dask, but are likely things we will 
want to add in eventually.  cc @jorisvandenbossche 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] rjzamora commented on pull request #7546: ARROW-8733: [C++][Dataset][Python] Expose RowGroupInfo statistics values

Reply via email to