Github user njayaram2 commented on a diff in the pull request:
https://github.com/apache/incubator-madlib/pull/49#discussion_r67950103
--- Diff: src/ports/postgres/modules/utilities/sessionize.py_in ---
@@ -35,41 +36,83 @@ def sessionize(schema_madlib, source_table,
output_table, partition_expr,
@param source_table: str, Name of the input table/view
@param output_table: str, Name of the table to store result
@param partition_expr: str, Expression to partition (group) the
input data
- @param time_stamp: str, Column name with time used for
sessionization calculation
+ @param time_stamp: str, The time stamp column name that is used
for sessionization calculation
@param max_time: interval, Delta time between subsequent events to
define a session
-
+ @param output_cols: str, list of columns the output table/view
must contain (default '*'):
+ * - all columns in the input table, and a new
session ID column
+ 'a,b,c,...' - a comma separated list of column
names/expressions to be projected, along with a new session ID column
+ @param create_view: boolean, indicates if the output is a view or
a table with name specified by output_table (default TRUE)
+ TRUE - create view
+ FALSE - materialize results into a table
"""
with MinWarning("error"):
_validate(source_table, output_table, partition_expr, time_stamp,
max_time)
+ table_or_view = 'VIEW' if create_view or create_view is None else
'TABLE'
+ output_cols_to_project = '*' if output_cols is None else
output_cols
- all_input_cols_str = ', '.join([i.strip() for i in
get_cols(source_table, schema_madlib)])
+ cols_to_project = get_column_names(schema_madlib, source_table,
output_cols_to_project)
session_id = 'session_id' if not is_var_valid(source_table,
'session_id') else unique_string('session_id')
# Create temp column names for intermediate columns.
new_partition = unique_string('new_partition')
new_session = unique_string('new_session')
plpy.execute("""
- CREATE TABLE {output_table} AS
+ CREATE {table_or_view} {output_table} AS
SELECT
- {all_input_cols_str},
+ {cols_to_project},
CASE WHEN {time_stamp} IS NOT NULL
- THEN SUM(CASE WHEN {new_partition} OR
{new_session} THEN 1 END)
- OVER (PARTITION BY {partition_expr}
- ORDER BY {time_stamp})
- END AS {session_id}
+ THEN SUM(CASE WHEN {new_partition} OR
{new_session} THEN 1 END) OVER (PARTITION BY {partition_expr} ORDER BY
{time_stamp}) END AS {session_id}
FROM (
SELECT *,
- ROW_NUMBER() OVER (w) = 1
- AND {time_stamp} IS NOT NULL AS
{new_partition},
- ({time_stamp} - LAG({time_stamp}, 1)
- OVER (w)) > '{max_time}'::INTERVAL AS
{new_session}
- FROM {source_table}
- WINDOW w AS (PARTITION BY {partition_expr}
- ORDER BY {time_stamp})
+ ROW_NUMBER() OVER (w) = 1 AND {time_stamp} IS
NOT NULL AS {new_partition},
+ ({time_stamp}-LAG({time_stamp}, 1) OVER (w)) >
'{max_time}'::INTERVAL AS {new_session}
+ FROM {source_table} WINDOW w AS (PARTITION BY
{partition_expr} ORDER BY {time_stamp})
) a
""".format(**locals()))
+def get_column_names(schema_madlib, source_table, output_cols):
+ """
+ This method creates a string that can be used in the SQL statement
to project the columns specified in the output_cols parameter.
+
+ Return:
+ a string to be used in the SQL statement
+ """
+ table_columns_list = get_cols(source_table, schema_madlib)
+ if output_cols.strip() == '*':
+ output_cols_str = get_cols_str(table_columns_list)
+ else:
+ output_cols_list, output_cols_names =
get_columns_from_expression(output_cols, table_columns_list)
+ _validate_output_cols(source_table, output_cols_list)
+ output_cols_str = ', '.join([output_cols_names[i] if
output_cols_list[i] == '*' else output_cols_list[i] + ' AS ' +
output_cols_names[i]
+ for i in range(len(output_cols_list))])
+ return output_cols_str
+
+def create_column_name_from_expression(col_name, table_columns_list):
+ if col_name == '*':
+ return get_cols_str(table_columns_list)
+ else:
+ # Column name cannot have more than one pair of quotes in it.
Removing any existing quotes, and then quoting the new string obtained.
--- End diff --
I am not sure I understand the comment exactly. This method does take care
of column names with spaces in the name. Say I had a column name called "user
id" (note that it is quoted, since postgres does not let me name a column with
spaces without quotes), and another column called revenue.
If my output_cols parameter was:
' "user id"<100, revenue>20 '
I parse this to rename each expression:
- "user id"<100 is named as "user id<100"
- revenue>20 is named as "revenue>20"
This ends up creating a select statement as follows:
SELECT "user id"<100 AS "user id<100", revenue>20 AS "revenue>20"
All I am doing in this function is creating a new column name for an
expression. I quote the expression to create a new column name, but before
doing that, I remove any existing quotes from the expression (because postgres
does not let me have quotes inside a quoted column name).
I checked again to make sure this is indeed how its working. I will put up
some install checks to cover these cases too.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---