magarick opened a new issue, #462:
URL: https://github.com/apache/arrow-datafusion-python/issues/462
As discussed in the rust arrow chat, this is a place to chart a way forward
by collecting examples of both what people do in other libraries and what they
want to do but can't do easily with current tools. In addition to clarifying
Python interface requirements, I hope it provides fodder for lower-level
functions, encourages knowledgeable folks to explain how to do things (which
can then get documented), and clarifies what is important to what types of
people and why. There's a daunting amount of work, but the opportunity and
potential are tremendous.
For contributors, I'd like to structure this roughly as follows:
- If it's your first post in this thread, a brief description of your
background and the types of work you've done/do.
- Examples of operations on data you think are important. Broadly of the
form "I can do X in library Y with code sample Z". If it's something that you
like, consider saying why. If you don't like it, explain how it could be easier
or clearer.
- Examples of things you frequently do but have had to implement yourself
and think should be considered for core operations.
- OPTIONALLY: Thoughts you have on what's important to a good data frame API
in a dynamic language.
**OK, now it's my turn to start.**
My background is in statistics, machine learning, and data science. Most of
what I do is focused on modeling and analyzing data, though I've done a good
bit of pipeline and data processing/cleaning too. As such, I place a premium on
in-memory interactive work (notebooks, rmarkdown, etc.). Partly because of
this, I think a lot of tools mistake verboseness for clarity, and striving for
conciseness often helps readability rather than hurting it if done correctly.
I have the most experience with R's `data.table` but I've also used dplyr,
polars, pandas, and Julia. So here's sampling of a few things I would want in a
dataframe library in no particular order.
1. Here's an example in `data.table` that shows features I think are both
good and bad.
```R
> dt1 = data.table(t = 1:5, v = 5:1)
> dt2 = data.table(start = c(1, 4), end = c(3, 10), x = c("a", "b"))
> dt1[dt2, x := i.x, on = .(t >= start, t < end)]
> dt1
t v x
1: 1 5 a
2: 2 4 a
3: 3 3 <NA>
4: 4 2 b
5: 5 1 b
```
First, I find non-equi joins, especially range joins incredibly useful.
They're common in SQL but a lot of dataframe libraries don't have them.
`data.table` also makes it easy to update in place with the `:=` operator,
which can be used to create new columns as well as update existing ones. As I
understand it, arrow strives for immutability, but at the same time, it won't
make copies of the whole frame if it doesn't have to, so maybe this is less of
an issue. However, I do like the idea of using a join like this to explicitly
tag/annotate another table.
2. Reshaping data. These functions transform data between "wide" to "long"
(sometimes known as "tidy") formats. Sometimes they go by `cast/melt` or
`pivot` and there's even a simple `transpose` function in a lot of packages. It
doesn't seem common in database-world, but to me reshaping in-memory data is
important for a lot of use cases.
3. Rolling groupbys. Both Pandas and Polars have pretty good support for
creating overlapping groups and aggregating over them. These are commonly used
for time series analysis. I also think the ability to define groups not just by
a number of rows, but by a potentially variable-width lookback (like, at most 1
month before the current date) is useful. Polars does a pretty good job at
this, and I think Pandas might too.
Alright, that's a few to get started and this is long enough as it is. I'm
looking forward to seeing what everyone thinks is important, their thoughts on
good DataFrame API design, and what is and isn't currently possible in
DataFusion.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]