alamb commented on issue #14478: URL: https://github.com/apache/datafusion/issues/14478#issuecomment-2637306176
This is a great list -- thank you @ozankabak and everyone for the ideas Idea proposal: > Note that our aim is to answer the following questions positively as we graduate students from the program: I would like to suggest an additional goal which is: 1. Create public written artifact ([DataFusion Blog Post](https://datafusion.apache.org/blog/)) explaining the project and why it is great. As an example of what I have in mind, check check out @XiangpengHao 's (our intern from last year)'s writeups on [Using StringView / German Style Strings to Make Queries Faster: Part 1 - Reading Parquet](https://www.influxdata.com/blog/faster-queries-with-stringview-part-one-influxdb/) and [How Good is Parquet for Wide Tables (Machine Learning Workloads) Really?](https://www.influxdata.com/blog/how-good-parquet-wide-tables/) ## Projects I would be willing (and happy!) to help mentor: - ✅ https://github.com/apache/datafusion/issues/5504 this is a great project and long overdue in my mind - ✅ https://github.com/apache/datafusion/issues/13815 (this would also be great) - ✅ https://github.com/apache/datafusion/issues/13816 ## Probably Not: Aggregation Performance > @alamb, can you help mentor students who would be interested in improving aggregation performance or correlated subqueries? I think @jayzhan-synnada can also help with mentoring aggregation performance work as he spent on it before as well. I would probably decline advising aggregation performance as an intern project as I don't think it would be a good intern experience (unless it was an exceptional intern -- see below). The code is already quite highly optimized and I don't have any simple ideas to make it faster (though maybe @Rachelint does). Any changes here must be made quite carefully to avoid regressions The ideal candidate would be someone already with very strong Rust and low level optimization experience (we can teach them the needed database internals). ## Probably Not: Correlated Subqueries For this project: - https://github.com/apache/datafusion/issues/5483 For this project, I think a successful candidate (and the kind of person I would be happy to help mentor) is graduate students who have a background in query optimizers (for example, can explain clearly what [Unnesting Arbitrary Queries](https://btw-2015.informatik.uni-hamburg.de/res/proceedings/Hauptband/Wiss/Neumann-Unnesting_Arbitrary_Querie.pdf) is about) This isn't a great project for someone who doesn't already have a deep understanding on queries, join graphs, and subqueries as it will take most of the summer just to understand what we are trying to do -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org