[GitHub] [arrow-datafusion-python] magarick opened a new issue, #462: User Stories for Interface / Feature Design and Documentation

via GitHub Mon, 21 Aug 2023 18:31:10 -0700


magarick opened a new issue, #462:
URL: https://github.com/apache/arrow-datafusion-python/issues/462


   As discussed in the rust arrow chat, this is a place to chart a way forward 
by collecting examples of both what people do in other libraries and what they 
want to do but can't do easily with current tools. In addition to clarifying 
Python interface requirements, I hope it provides fodder for lower-level 
functions, encourages knowledgeable folks to explain how to do things (which 
can then get documented), and clarifies what is important to what types of 
people and why. There's a daunting amount of work, but the opportunity and 
potential are tremendous.
   
   For contributors, I'd like to structure this roughly as follows:
   - If it's your first post in this thread, a brief description of your 
background and the types of work you've done/do.
   - Examples of operations on data you think are important. Broadly of the 
form "I can do X in library Y with code sample Z". If it's something that you 
like, consider saying why. If you don't like it, explain how it could be easier 
or clearer.
   - Examples of things you frequently do but have had to implement yourself 
and think should be considered for core operations.
   - OPTIONALLY: Thoughts you have on what's important to a good data frame API 
in a dynamic language.
   
   **OK, now it's my turn to start.**
   My background is in statistics, machine learning, and data science. Most of 
what I do is focused on modeling and analyzing data, though I've done a good 
bit of pipeline and data processing/cleaning too. As such, I place a premium on 
in-memory interactive work (notebooks, rmarkdown, etc.). Partly because of 
this, I think a lot of tools mistake verboseness for clarity, and striving for 
conciseness often helps readability rather than hurting it if done correctly.
   
   I have the most experience with R's `data.table` but I've also used dplyr, 
polars, pandas, and Julia. So here's sampling of a few things I would want in a 
dataframe library in no particular order.
   
   1. Here's an example in `data.table` that shows features I think are both 
good and bad.
   ```R
   > dt1 = data.table(t = 1:5, v = 5:1)
   > dt2 = data.table(start = c(1, 4), end = c(3, 10), x = c("a", "b"))
   > dt1[dt2, x := i.x, on = .(t >= start, t < end)]
   > dt1
      t v    x
   1: 1 5    a
   2: 2 4    a
   3: 3 3 <NA>
   4: 4 2    b
   5: 5 1    b
   ```
   
   First, I find non-equi joins, especially range joins incredibly useful. 
They're common in SQL but a lot of dataframe libraries don't have them. 
`data.table` also makes it easy to update in place with the `:=` operator, 
which can be used to create new columns as well as update existing ones. As I 
understand it, arrow strives for immutability, but at the same time, it won't 
make copies of the whole frame if it doesn't have to, so maybe this is less of 
an issue. However, I do like the idea of using a join like this to explicitly 
tag/annotate another table.
   
   2. Reshaping data. These functions transform data between "wide" to "long" 
(sometimes known as "tidy") formats. Sometimes they go by `cast/melt` or 
`pivot` and there's even a simple `transpose` function in a lot of packages. It 
doesn't seem common in database-world, but to me reshaping in-memory data is 
important for a lot of use cases.
   
   3. Rolling groupbys. Both Pandas and Polars have pretty good support for 
creating overlapping groups and aggregating over them. These are commonly used 
for time series analysis. I also think the ability to define groups not just by 
a number of rows, but by a potentially variable-width lookback (like, at most 1 
month before the current date) is useful. Polars does a pretty good job at 
this, and I think Pandas might too.
   
   Alright, that's a few to get started and this is long enough as it is. I'm 
looking forward to seeing what everyone thinks is important, their thoughts on 
good DataFrame API design, and what is and isn't currently possible in 
DataFusion.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion-python] magarick opened a new issue, #462: User Stories for Interface / Feature Design and Documentation

Reply via email to