I emailed Marc Fournier on this topic some weeks back, but haven't heard from him.
I am teaching the undergrad DB course at UC Berkeley, something I do with some frequency. We have the usual 180 students we get every semester (yep: 180!), but this year we've instituted 2 changes:
1) We changed the course projects to make the students hack PostgreSQL internals, rather than the "minibase" eduware
2) We are coordinating the class with a class at CMU being taught by Prof. Anastassia ("Natassa") Ailamaki
Our "Homework 2", which is being passed out this week, will ask the students to implement a hash-based grouping that spills to disk. I understand this topic has been batted about the pgsql-hackers list recently. The TAs who've prepared the assignment (Sailesh Krishnamurthy at Berkeley and Spiros Papadimitriou at CMU) have also implemented a reference solution to assignment. Once we've got the students' projects all turned in, we'll be very happy to contribute our code back the PostgreSQL project.
I'm hopeful this will lead to many good things:
1) Each year we can pick another feature to assign in class, and contribute back. We'll need to come up with well-scoped engine features that exercise concepts from the class -- eventually we'll run out of tractable things that PGSQL needs, but not in the next couple years I bet.
2) We'll raise a crop of good students who know Postgres internals. Roughly half the Berkeley EECS undergrads take the DB class, and all of them will be post-hackers! (Again, I don't know the stats at CMU.)
So consider this a heads up on the hash-agg front, and on the future contributions front. I'll follow up with another email on PostgreSQL-centered research in our group at Berkeley as well.
Another favor I'd ask is that people on the list be a bit hesitant about helping our students with their homework! We would like them to do it themselves, more or less :-)
Regards,
Joe Hellerstein
--
Joseph M. Hellerstein
Professor, EECS Computer Science Division
UC Berkeley
http://www.cs.berkeley.edu/~jmh
On Tuesday, February 11, 2003, at 06:54 PM, Sailesh Krishnamurthy wrote:
From: Hannu Krosing <[EMAIL PROTECTED]>
Date: Tue Feb 11, 2003 12:21:26 PM US/Pacific
To: Tom Lane <[EMAIL PROTECTED]>
Cc: Bruno Wolff III <[EMAIL PROTECTED]>, Greg Stark <[EMAIL PROTECTED]>, [EMAIL PROTECTED]
Subject: Re: [HACKERS] Hash grouping, aggregates
Tom Lane kirjutas T, 11.02.2003 kell 18:39:
If we run out of sort memory, we can always bail out later, preferrablyBruno Wolff III <[EMAIL PROTECTED]> writes:Tom Lane <[EMAIL PROTECTED]> wrote:Greg Stark <[EMAIL PROTECTED]> writes:The neat thing is that hash aggregates would allow grouping on data types thatHm. Right now I think that would barf on you, because the parser wants
have = operators but no useful < operator.
to find the '<' operator to label the grouping column with, even if the
planner later decides not to use it. It'd take some redesign of the
query data structure (specifically SortClause/GroupClause) to avoid that.
I think another issue is that for some = operators you still might notRight, the = operator must be hashable or you're out of luck. But we
be able to use a hash. I would expect the discussion for hash joins in
http://developer.postgresql.org/docs/postgres/xoper-optimization.html
would to hash aggregates as well.
could imagine tweaking the parser to allow GROUP BY if it finds a
hashable = operator and no sort operator. The only objection I can see
to this is that it means the planner *must* use hash aggregation, which
might be a bad move if there are too many distinct groups.
with a descriptive error message. It is not as elegant as erring out at
parse (or even plan/optimise) time, but the result is /almost/ the same.
Relying on hash aggregation will become essential if we are ever going
to implement the "other" groupings (CUBE, ROLLUP, (), ...), so it would
be nice if hash aggregation could also overflow to disk - I suspect that
this will still be faster that running an independent scan for each
GROUP BY grouping and merging the results.
-----
Hannu
---------------------------(end of broadcast)---------------------------
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
--
Pip-pip
Sailesh
http://www.cs.berkeley.edu/~sailesh
---------------------------(end of broadcast)--------------------------- TIP 6: Have you searched our list archives? http://archives.postgresql.org