Bryan Pendleton wrote:

[snip]

With the new restriction approach that you are proposing for DERBY-2998,
am I correct that you are suggesting we might invent a new ResultSet
node implementation, which computes the ROW_NUMBER() value for its rows?

Almost.

In my patch I'm suggesting to introduce a new ResultSet, called OLAPResultSet, that is capable of evaluating restrictions and passing rows to the above ResultSet (in getNextRowCore()). This OLAPResultSet is based on how index result sets are handled, and will look at each row and evaluate it against the supplied restriction GeneratedMethod code snippet. As a comparison BulkTableScanResultSet for instance, will pass a chunk of rows up not caring about restrictions at all.

My intention was that OLAPResultSet could be expanded in the future with other OLAP specifics that need to live "somewhere" - for example handling windowing and/or other ranking functions.

The actual row number value is (as of today) computed using a generated method call into BaseActivation (similar to autoincrement). This may change.

If I am understanding that correctly, are you envisioning that a WHERE
clause which refers to the ROW_NUMBER() result will be implemented by
generating comparison code into the "restriction" method of the generated
activation code?

Correct.

The restricton starts its life as a whereClause in the SelectNode. Then goes on to become a Predicate passed up and down the querytree during optimization. It ends up as a GeneratedMethod restriction snippet after bytecode generation. This GeneratedMethod is run/executed by the ResultSet (or rather it's Activation, if I understand this correctly) for any given row or set of rows depending on the specific ResultSet implementation.

It seems to me that the (potential) optimization win by performing the
ROW_NUMBER() restriction at the right level of the execution tree is
quite high (in fact, it's the whole point of DERBY-581), so it makes
sense to me that it would be worth the optimizer's effort to consider
alternate locations for it, and thus you seem to be heading down the
right path with your research.
Correct.

A WHERE restriction on a ROW_NUMBER() column should be pushed down to the subquery containing the function. The ROW_NUMBER() function is known to be ever increasing, and we can utilize that when limiting the number of rows we pass up the chain (like we can with an index...).

Worst case scenario is what my patch attached to DERBY-2998 is as of today - no optimization cause a full table scan in the inner select - lets say 1 million rows - and the restriction is evaluated in the outer select - lets say limiting to 5 rows. So in this specific case we read 999 995 rows unneccessary. The optimized version would have pushed the restriction into the inner select, and stopped pushing rows out to the outer select after the 5th row of the 1 million. Which would be a lot faster, obviously. The larger the table the larger the penalty.

I guess I'm mostly just trying to educate myself here, but hopefully
some of these ideas and questions will help you clarify your studies and
help us all understand how this processing works.

I think we are all learing from this - at least I am :)

Cheers,
Thomas

Reply via email to