GitHub user robertamarton opened a pull request: https://github.com/apache/incubator-trafodion/pull/773
[TRAFODION-2301]: Hadoop crash with logs TMUDF Today the UDF event_log_reader scans all logs, loads events into memory and then discards the rows that are not needed. Waiting until the end to discard rows takes too much memory and causes system issues. The immediate solution is to use predicate pushdown; that is, specify predicates on the query using the event_log_reader UDF to limit the scope of the data flow. These predicates will be pushed into the UDF so the UDF only returns the required rows instead of all the rows. Initially only comparison predicates are pushed down to the event_log_reader UDF. In addition to predicate pushdown, a new option has been added to the event_log_reader UDF - the 's' (statistics) option. This option reports how many log files were accessed, how many records were read, and how many records were returned. By specifying timestamp ranges, severity types, sql_codes, and the like, the number of returned rows can be reduced. Example output: Prior to change: select count(*) from udf(event_log_reader('s')) where severity = 'INFO' and log_ts between '2016-10-18 00:00:00' and '2016-10-18 22:22:22'; (16497) EVENT_LOG_READER results: number log files opened: 113, number log files read: 113, number rows read: 2820, number rows returned: 2736 After change: select count(*) from udf(event_log_reader('s')) where severity = 'INFO' and log_ts between '2016-10-18 00:00:00' and '2016-10-18 22:22:22'; (17046) EVENT_LOG_READER results: number log files opened: 115, number log files read: 115, number rows read: 2823, number rows returned: 109 You can merge this pull request into a Git repository by running: $ git pull https://github.com/robertamarton/incubator-trafodion trafodion-1758 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/incubator-trafodion/pull/773.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #773 ---- commit 913d2337e029a0f904539a1d9d6ea064f90aa6ab Author: Roberta Marton <rmarton@edev07.esgyn.local> Date: 2016-10-21T01:37:33Z [TRAFODION-2301]: Hadoop crash with logs TMUDF Today the UDF event_log_reader scans all logs, loads events into memory and then discards the rows that are not needed. Waiting until the end to discard rows takes too much memory and causes system issues. The immediate solution is to use predicate pushdown; that is, specify predicates on the query using the event_log_reader UDF to limit the scope of the data flow. These predicates will be pushed into the UDF so the UDF only returns the required rows instead of all the rows. Initially only comparison predicates are pushed down to the event_log_reader UDF. In addition to predicate pushdown, a new option has been added to the event_log_reader UDF - the 's' (statistics) option. This option reports how many log files were accessed, how many records were read, and how many records were returned. By specifying timestamp ranges, severity types, sql_codes, and the like, the number of returned rows can be reduced. Example output: Prior to change: select count(*) from udf(event_log_reader('s')) where severity = 'INFO' and log_ts between '2016-10-18 00:00:00' and '2016-10-18 22:22:22'; (16497) EVENT_LOG_READER results: number log files opened: 113, number log files read: 113, number rows read: 2820, number rows returned: 2736 After change: select count(*) from udf(event_log_reader('s')) where severity = 'INFO' and log_ts between '2016-10-18 00:00:00' and '2016-10-18 22:22:22'; (17046) EVENT_LOG_READER results: number log files opened: 115, number log files read: 115, number rows read: 2823, number rows returned: 109 ---- --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---