amol- commented on pull request #11624: URL: https://github.com/apache/arrow/pull/11624#issuecomment-970313614
> pyarrow would add yet another slightly different interface. > (but I also agree that groupby is not a great name as method on the table for this reason) > I don't have a strong opinion about the single step or multi step API. I personally rarely ever had the need to do a grouping without an associated aggregation, so I feel that the value of the multistep approach isn't huge, even thought it might be easier to evolve in the future. > Playing a bit with this branch, some other observations: > > * I find it unexpected that the resulting table always has "key" column instead of reusing the original name that was specified as the key column > * Is it possible to group by multiple columns? Not in the current bindings in this PR, but I suppose in c++ / R this is already possible? > * I think users will very quickly request the ability to specify the resulting column name .. (to not have things like "column_count_distinct") I implemented support for the first two points in https://github.com/apache/arrow/pull/11624/commits/dfecba12901e6ff13181886b052164f734170d67 Regarding the third one, I wonder if that would be best satisfied by extending the `Table.rename_columns` API to support a mapping of column names IE: ``` t.rename_column({"oldcolname": "newcolname"}) ``` that might be convenient for other use cases too (for example when willing to rename only a subset of columns) and would expose the ability to do ``` t.group_by("keycol", ["value1"], ["sum"]).rename_column({"value1_sum": "total"}) ``` ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
