Re: [Important] GSoC 2024 Project Ideas

Paul Rogers Sat, 27 Jan 2024 12:27:45 -0800

Some ideas:

* Time marches on. Drill has a design from ten years back. What modern
environment things do current users need? Integration with Amazon Glue?
Delta lake/lakehouse/whatever the cool new thing is? Integration with the
latest & greatest BI tools?
* Seems many folks use Drill as a desktop tool. But, Drill is designed for
a distributed environment. Could we provide an in-process exchange operator
that just shifts ownership of vectors rather than serializing them over the
network back to the same process? What other changes would be helpful?
* Implementing modern JSON support: store complex types as Java objects
using the Object Vector. Implement the standard SQL JSON functions.
* Add an Avatica-based JDBC interface. I can provide some of the
server-side stuff from a project I did many moons ago. The benefit is the
ability to use Drill without pulling in a large amount of the Drill code
base with its Guava dependencies, etc.
* Fix the timestamp issue: use UTC throughout rather than the current mix
of UTC and local time. Ensure tests pass regardless of the timezone on the
local machine.
* Implement a 64-bit timestamp type to help with that Parquet extension
that someone is adding. I might be able to dig up a proposal that was done
a few years back. Basically, use the Int64 vector for storage, add support
for nanos in the type functions.
* Review compilation performance using the old-school Janinio + byte code
fixups vs. letting modern Java do the work. Five years ago, Java was
faster. Today, it is probably even better. Scrap all the complex code
associated with the old way of doing the work if Java is, in fact, faster.
* Fancy up our Docker and K8s support. Build that all-in-one desktop Drill
image. Ensure the Drill images are up to date on DockerHub. Finish and/or
update the K8s support: Helm chart? Something newer?
* Test Drill on the latest Java versions. Any code changes or library
issues with compiling with the latest? If so, file a JIRA with all the
library issues so they can be tackled. Fix any Drill issues.
* Create a demo data science environment in Python: the equivalent of
SqlLine, but with Pandas, charts, conversion to numpy arrays, etc. Maybe
have this be a Docker container that can run alongside the improved Drill
one. Write a blog post on Medium or whatever people use these days. Note
when to use the simpler Arrow-based stack vs. when to move up to a true DB
engine.
* Extend the Daffodil work. Address the questions from one of my emails:
can we find a common metadata format so that Daffodil is just one of
several supported sources of metadata? Allow Daffodil to describe any
supported Drill datasource. Integrate Daffodil's file format data with
"statistics" about which files hold which data. Etc.
* For someone from a marketing background: try to find out where Drill is
used today and what that new user base needs. Extra credit: figure out how
to reach similar people who may not have heard of the project, but who
would also benefit from it.


Many of those are non-trivial projects that would appeal to overachiever
types. Sounds like James can prepare a list of projects for the folks with
more typical skills and time commitments.

Thanks,

- Paul

On Sat, Jan 27, 2024 at 8:11 AM James Turton <dz...@apache.org> wrote:

> Supplement: a recent article and commentary on said DBs.
>
> https://news.ycombinator.com/item?id=39119198
>
> On 2024/01/27 18:08, James Turton wrote:
> > I thought of vector database storage / format plugins for Drill  to
> > tick their AI/ML box but it isn't clear to me that doing SQL over
> > those datasets is of any use to anyone. I think that we do have other
> > interesting, if unfashionable, lines of work that we could propose.
> >
> > On 2024/01/25 14:20, Priya Sharma wrote:
> >> Hello PMCs,
> >>
> >> Google Summer of Code is the ideal opportunity for you to attract new
> >> contributors to your projects and GSoC 2024 is here.
> >>
> >> The ASF will be applying as a participating organization for GSoC 2024.
> >> As a part of the application we need you all to *mandatorily* start
> >> recording your ideas now [1] latest by 3rd Feb.
> >>
> >> There is slight change in the rules this year, just reiterating here:
> >> - For the 2024 program, there will be three options for project scope:
> >> medium at ~175 hours, large at ~350 hours and a new size: small at ~90
> >> hours.
> >>    Please add "*full-time*" label to the JIRA for 350 hour project ,
> >> "*part-time*" label for 175 hours project and “*small*” for a 90 hour
> >> project.
> >>
> >> Note: They are looking to bring more open source projects in the AI/ML
> >> field into GSoC 2024, so we encourage more projects from this domain
> >> to participate.
> >>
> >> If you are a new mentor or your project is participating for the first
> >> time, please read [2][3].
> >>
> >> On behalf of the GSoC 2024 admins,
> >> Please feel free to reach out to us in case of queries or concerns.
> >>
> >> [1] https://s.apache.org/gsoc2024ideas
> >> [2] https://community.apache.org/gsoc.html
> >> [3] https://community.apache.org/guide-to-being-a-mentor.html
> >
>
>

Re: [Important] GSoC 2024 Project Ideas

Reply via email to