Bryan Pendleton wrote:
[snip]
With the new restriction approach that you are proposing for DERBY-2998,
am I correct that you are suggesting we might invent a new ResultSet
node implementation, which computes the ROW_NUMBER() value for its rows?
Almost.
In my patch I'm suggesting to introduce a new ResultSet, called
OLAPResultSet, that is capable of evaluating restrictions and passing
rows to the above ResultSet (in getNextRowCore()). This OLAPResultSet is
based on how index result sets are handled, and will look at each row
and evaluate it against the supplied restriction GeneratedMethod code
snippet.
As a comparison BulkTableScanResultSet for instance, will pass a chunk
of rows up not caring about restrictions at all.
My intention was that OLAPResultSet could be expanded in the future with
other OLAP specifics that need to live "somewhere" - for example
handling windowing and/or other ranking functions.
The actual row number value is (as of today) computed using a generated
method call into BaseActivation (similar to autoincrement). This may change.
If I am understanding that correctly, are you envisioning that a WHERE
clause which refers to the ROW_NUMBER() result will be implemented by
generating comparison code into the "restriction" method of the generated
activation code?
Correct.
The restricton starts its life as a whereClause in the SelectNode. Then
goes on to become a Predicate passed up and down the querytree during
optimization.
It ends up as a GeneratedMethod restriction snippet after bytecode
generation. This GeneratedMethod is run/executed by the ResultSet (or
rather it's Activation, if I understand this correctly) for any given
row or set of rows depending on the specific ResultSet implementation.
It seems to me that the (potential) optimization win by performing the
ROW_NUMBER() restriction at the right level of the execution tree is
quite high (in fact, it's the whole point of DERBY-581), so it makes
sense to me that it would be worth the optimizer's effort to consider
alternate locations for it, and thus you seem to be heading down the
right path with your research.
Correct.
A WHERE restriction on a ROW_NUMBER() column should be pushed down to
the subquery containing the function. The ROW_NUMBER() function is known
to be ever increasing, and we can utilize that when limiting the number
of rows we pass up the chain (like we can with an index...).
Worst case scenario is what my patch attached to DERBY-2998 is as of
today - no optimization cause a full table scan in the inner select -
lets say 1 million rows - and the restriction is evaluated in the outer
select - lets say limiting to 5 rows. So in this specific case we read
999 995 rows unneccessary.
The optimized version would have pushed the restriction into the inner
select, and stopped pushing rows out to the outer select after the 5th
row of the 1 million. Which would be a lot faster, obviously. The larger
the table the larger the penalty.
I guess I'm mostly just trying to educate myself here, but hopefully
some of these ideas and questions will help you clarify your studies and
help us all understand how this processing works.
I think we are all learing from this - at least I am :)
Cheers,
Thomas