On Thu, Nov 14, 2013 at 6:51 PM, Kyotaro HORIGUCHI < horiguchi.kyot...@lab.ntt.co.jp> wrote:
> Hello, > > > When I read it again and try to relate, I get your point. Actually true, > > hashes must always be performed as last option (if that is what you too > > meant) and if there are few other operations they must be the last one to > > be performed especially after sorting/grouping. Hashes must somehow make > > use of already sorted data (I think this something even you indicated) > > Yes, some 'hash'es could preserve order selecting such a function > for hash function. But at least PostgreSQL's 'HashAggregation' > uses not-order-preserving function as hash function. So output > cannot preserve input ordering. > > > I will do that if I get a DB2 system or Oracle system running. I will try > > to replicate the same 2 test cases and share the plan. One thing which I > am > > sure is, the below part of the plan > > > > QUERY PLAN | Subquery Scan on __unnamed_subquery_0 > > (cost=12971.39..16964.99 rows=614 width=43) (actual > > time=2606.075..2953.937 rows=558 loops=1) > > > > would be generated as RID scan in DB2 (which I have seen to perform > better > > than normal subquery scans in DB2). > > DB2's document says it is used for 'index ORing' corresponds OR > or IN ops, which seems to be a relative to BitmapOr of > PostgreSQL, perhaps not to HashAggregates/SemiJoin. > > I tried to imagin the plan for the group_by case with repeated > index scan and merging.. > > > select student_name > > from student_score > > where (course,score) in (select course,max(score) > > from student_score group by course); > > Taking the advantage that the cardinarity of course is 8, this > query could be transformed into 8 times of index scan and > bitmaping. > > With hypothetical plan node LOOP, and BitmapScanAdd the plan > could be, > > | BitmapHeapScan (rows = 155, loops = 1) > | -> LOOP > | ON Subquery (select distinct course from student_course) as c0 > | -> BitmapScanAdd (loops = 8) > | BitmapCond: (student_score.score = x) > | -> Limit (rows = 1, loops = 8) AS x > | -> Unique (rows = 1, loops = 8) > | -> IndexScan using idx_score on student_course (rows = 1, > loops = 8) > | Filter (student_course.course = c0) > > I suppose this is one possibility of what DB2 is doing. If DB2 > does the same optimization for ranking > 1 with the dense_rank() > case, this plan might be like this, > > I can not be sure but this one seems logically correct from cost and cardinality perspective(am not sure the operations that the DB2 planner would generate ). Need to test it. > | BitmapHeapScan (rows = 133, loops = 1) > | -> LOOP > | ON Subquery (select distinct course from student_course) as c0 > | -> BitmapScanAdd (loops = 8) > | BitmapCond: (student_score.score = x) > | -> Limit (rows = 1, loops = 8) AS x > | -> Unique (rows = 2, loops = 8) > | -> IndexScan using idx_score on student_course (rows = > 18,loops= 8) > | Filter (student_course.course = c0) > > Both plans surely seem to be done shortly for relatively small > n's and number of courses. > > I guess they would do well even when the cardinality of courses is fairly high (unless we hit a scenario where courses offered are more than/in the same decimal range as students opting for them). On the other hand, using semijoin as PostgreSQL does, creating > HashAggregate storing nth place score for every course requires > some memory to work on for each course. > > | Hash Semi Join > | Hash Cond: (a.course = b.course and a.score = b.score) > | -> SeqScan on student_score as a > | -> Hash > | -> HashAggregatesFunc (rows = 8) > | Key = course, func = rankn(dense_rank(), n, key, val) > | -> SeqScan on student_score (rows = 122880) > > Where, rankn() must keep socres down to nth rank and emits nth > score as final value. I don't get more generic form for this > mechanism right now, and the value to do in this specific manner > seems not so much.. > > > I feel the advantage could be more when dealing with a DW environment which has more complex aggregate and windowing queries. Extending this to other windowing function, it could be a great gain for DW and OLAP queries. Regards Sameer