[GitHub] incubator-madlib pull request #49: Feature: Sessionize funtion - Phase 2

njayaram2 Tue, 21 Jun 2016 13:56:42 -0700

Github user njayaram2 commented on a diff in the pull request:

    https://github.com/apache/incubator-madlib/pull/49#discussion_r67950103
  
    --- Diff: src/ports/postgres/modules/utilities/sessionize.py_in ---
    @@ -35,41 +36,83 @@ def sessionize(schema_madlib, source_table, 
output_table, partition_expr,
             @param source_table: str, Name of the input table/view
             @param output_table: str, Name of the table to store result
             @param partition_expr: str, Expression to partition (group) the 
input data
    -        @param time_stamp: str, Column name with time used for 
sessionization calculation
    +        @param time_stamp: str, The time stamp column name that is used 
for sessionization calculation
             @param max_time: interval, Delta time between subsequent events to 
define a session
    -
    +        @param output_cols: str, list of columns the output table/view 
must contain (default '*'):
    +                        * - all columns in the input table, and a new 
session ID column
    +                        'a,b,c,...' -  a comma separated list of column 
names/expressions to be projected, along with a new session ID column
    +        @param create_view: boolean, indicates if the output is a view or 
a table with name specified by output_table (default TRUE)
    +                        TRUE - create view
    +                        FALSE - materialize results into a table
         """
         with MinWarning("error"):
             _validate(source_table, output_table, partition_expr, time_stamp, 
max_time)
    +        table_or_view = 'VIEW' if create_view or create_view is None else 
'TABLE'
    +        output_cols_to_project = '*' if output_cols is None else 
output_cols
     
    -        all_input_cols_str = ', '.join([i.strip() for i in 
get_cols(source_table, schema_madlib)])
    +        cols_to_project = get_column_names(schema_madlib, source_table, 
output_cols_to_project)
             session_id = 'session_id' if not is_var_valid(source_table, 
'session_id') else unique_string('session_id')
     
             # Create temp column names for intermediate columns.
             new_partition = unique_string('new_partition')
             new_session = unique_string('new_session')
     
             plpy.execute("""
    -                CREATE TABLE {output_table} AS
    +                CREATE {table_or_view} {output_table} AS
                         SELECT
    -                        {all_input_cols_str},
    +                        {cols_to_project},
                             CASE WHEN {time_stamp} IS NOT NULL
    -                             THEN SUM(CASE WHEN {new_partition} OR 
{new_session} THEN 1 END)
    -                                  OVER (PARTITION BY {partition_expr}
    -                                  ORDER BY {time_stamp})
    -                        END AS {session_id}
    +                        THEN SUM(CASE WHEN {new_partition} OR 
{new_session} THEN 1 END) OVER (PARTITION BY {partition_expr} ORDER BY 
{time_stamp}) END AS {session_id}
                         FROM (
                             SELECT *,
    -                            ROW_NUMBER() OVER (w) = 1
    -                                AND {time_stamp} IS NOT NULL AS 
{new_partition},
    -                            ({time_stamp} - LAG({time_stamp}, 1)
    -                                OVER (w)) > '{max_time}'::INTERVAL AS 
{new_session}
    -                        FROM {source_table}
    -                        WINDOW w AS (PARTITION BY {partition_expr}
    -                                     ORDER BY {time_stamp})
    +                            ROW_NUMBER() OVER (w) = 1 AND {time_stamp} IS 
NOT NULL AS {new_partition},
    +                            ({time_stamp}-LAG({time_stamp}, 1) OVER (w)) > 
'{max_time}'::INTERVAL AS {new_session}
    +                        FROM {source_table} WINDOW w AS (PARTITION BY 
{partition_expr} ORDER BY {time_stamp})
                             ) a
                 """.format(**locals()))
     
    +def get_column_names(schema_madlib, source_table, output_cols):
    +    """
    +        This method creates a string that can be used in the SQL statement 
to project the columns specified in the output_cols parameter.
    +
    +        Return:
    +        a string to be used in the SQL statement
    +    """
    +    table_columns_list = get_cols(source_table, schema_madlib)
    +    if output_cols.strip() == '*':
    +        output_cols_str = get_cols_str(table_columns_list)
    +    else:
    +        output_cols_list, output_cols_names = 
get_columns_from_expression(output_cols, table_columns_list)
    +        _validate_output_cols(source_table, output_cols_list)
    +        output_cols_str = ', '.join([output_cols_names[i] if 
output_cols_list[i] == '*' else output_cols_list[i] + ' AS ' + 
output_cols_names[i] 
    +            for i in range(len(output_cols_list))])
    +    return output_cols_str
    +
    +def create_column_name_from_expression(col_name, table_columns_list):
    +    if col_name == '*':
    +        return get_cols_str(table_columns_list)
    +    else:
    +        # Column name cannot have more than one pair of quotes in it. 
Removing any existing quotes, and then quoting the new string obtained.
    --- End diff --
    
    I am not sure I understand the comment exactly. This method does take care 
of column names with spaces in the name. Say I had a column name called "user 
id" (note that it is quoted, since postgres does not let me name a column with 
spaces without quotes), and another column called revenue. 
    If my output_cols parameter was:
    ' "user id"<100, revenue>20 '
    
    I parse this to rename each expression:
    - "user id"<100 is named as "user id<100"
    - revenue>20 is named as "revenue>20"
    
    This ends up creating a select statement as follows:
    SELECT "user id"<100 AS "user id<100", revenue>20 AS "revenue>20"
    
    All I am doing in this function is creating a new column name for an 
expression. I quote the expression to create a new column name, but before 
doing that, I remove any existing quotes from the expression (because postgres 
does not let me have quotes inside a quoted column name). 
    
    I checked again to make sure this is indeed how its working. I will put up 
some install checks to cover these cases too.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-madlib pull request #49: Feature: Sessionize funtion - Phase 2

Reply via email to