Hans Zeller created TRAFODION-2392:
--------------------------------------
Summary: Avoid a costly sort for highly reducing TMUDFs
Key: TRAFODION-2392
URL: https://issues.apache.org/jira/browse/TRAFODION-2392
Project: Apache Trafodion
Issue Type: Improvement
Components: sql-cmp
Affects Versions: 2.0-incubating
Environment: Any
Reporter: Hans Zeller
Assignee: Hans Zeller
When an input table with a PARTITION BY is specified in a TMUDF, the Trafodion
optimizer ensures that the input rows are sorted on (a permutation of) the
PARTITION BY columns, so that each parallel TMUDF instance sees the input rows
of such a logical partition in contiguous rows. This way the TMUDF can process
each group separately.
This is usually a good way to process the data, except when we are dealing with
a large input table and a TMUDF that highly reduces the input data. In that
case it may be better to maintain a hash table of groups in the TMUDF and to
avoid the costly sort of the input table.
My proposal is to add a new function type to UDRInvocationInfo.FunctionType,
called REDUCER_NC (for Non-Contiguous). Setting the function type to this new
type would indicate to the optimizer not to request a sort order on the
partitioning columns.
The table below shows how the function type and PARTITION BY and ORDER BY
clauses would determine the effective sort order produced by the optimizer:
||Function type||PARTITION BY||ORDER BY||Data is sorted by||
|REDUCER (existing)|a,b|c,d|a,b,c,d|
|REDUCER (existing)|a,b|<empty>|a,b|
|REDUCER_NC (proposed)|a,b|c,d|c,d|
|REDUCER_NC (proposed)|a,b|<empty>|<no sort>|
In all other aspects, REDUCER and REDUCER_NC function types would behave the
same.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)