Re: [I] Project Ideas for GSoC 2025 [datafusion]

via GitHub Wed, 05 Feb 2025 07:50:20 -0800


alamb commented on issue #14478:
URL: https://github.com/apache/datafusion/issues/14478#issuecomment-2637306176


   This is a great list -- thank you @ozankabak  and everyone for the ideas
   
   Idea proposal:
   > Note that our aim is to answer the following questions positively as we 
graduate students from the program:
   
   I would like to suggest an additional goal which is:
   1. Create public written artifact ([DataFusion Blog 
Post](https://datafusion.apache.org/blog/)) explaining the project and why it 
is great. As an example of what I have in mind, check check out @XiangpengHao 
's  (our intern from last year)'s writeups on [Using StringView / German Style 
Strings to Make Queries Faster: Part 1 - Reading 
Parquet](https://www.influxdata.com/blog/faster-queries-with-stringview-part-one-influxdb/)
 and [How Good is Parquet for Wide Tables (Machine Learning Workloads) 
Really?](https://www.influxdata.com/blog/how-good-parquet-wide-tables/)
   
   
   ## Projects I would be willing (and happy!) to help mentor:
   - ✅ https://github.com/apache/datafusion/issues/5504 this is a great project 
and long overdue in my mind
   - ✅ https://github.com/apache/datafusion/issues/13815 (this would also be 
great)
   - ✅ https://github.com/apache/datafusion/issues/13816
   
   ## Probably Not: Aggregation Performance
   > @alamb, can you help mentor students who would be interested in improving 
aggregation performance or correlated subqueries? I think @jayzhan-synnada can 
also help with mentoring aggregation performance work as he spent on it before 
as well.
   
   I would probably decline advising aggregation performance as an intern 
project as I don't think it would be a good intern experience  (unless it was 
an exceptional intern -- see below). The code is already quite highly optimized 
and I don't have any simple ideas to make it faster (though maybe @Rachelint 
does). Any changes here must be made quite carefully to avoid regressions
   
   The ideal candidate would be someone already with very strong Rust and low 
level optimization experience (we can teach them the needed database internals).
   
   ## Probably Not: Correlated Subqueries
   For this project:
   - https://github.com/apache/datafusion/issues/5483
   
   For this project, I think a successful candidate (and the kind of person I 
would be happy to help mentor) is graduate students who have a background in 
query optimizers (for example, can explain clearly what [Unnesting Arbitrary 
Queries](https://btw-2015.informatik.uni-hamburg.de/res/proceedings/Hauptband/Wiss/Neumann-Unnesting_Arbitrary_Querie.pdf)
 is about)
   
   This isn't a great project for someone who doesn't already have a deep 
understanding on queries, join graphs, and subqueries as it will take most of 
the summer just to understand what we are trying to do
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [I] Project Ideas for GSoC 2025 [datafusion]

Reply via email to