[GitHub] incubator-madlib pull request #49: Feature: Sessionize funtion - Phase 2

decibel Tue, 21 Jun 2016 14:22:16 -0700

Github user decibel commented on a diff in the pull request:

    https://github.com/apache/incubator-madlib/pull/49#discussion_r67954244
  
    --- Diff: src/ports/postgres/modules/utilities/sessionize.py_in ---
    @@ -35,41 +36,83 @@ def sessionize(schema_madlib, source_table, 
output_table, partition_expr,
             @param source_table: str, Name of the input table/view
             @param output_table: str, Name of the table to store result
             @param partition_expr: str, Expression to partition (group) the 
input data
    -        @param time_stamp: str, Column name with time used for 
sessionization calculation
    +        @param time_stamp: str, The time stamp column name that is used 
for sessionization calculation
             @param max_time: interval, Delta time between subsequent events to 
define a session
    -
    +        @param output_cols: str, list of columns the output table/view 
must contain (default '*'):
    +                        * - all columns in the input table, and a new 
session ID column
    +                        'a,b,c,...' -  a comma separated list of column 
names/expressions to be projected, along with a new session ID column
    +        @param create_view: boolean, indicates if the output is a view or 
a table with name specified by output_table (default TRUE)
    +                        TRUE - create view
    +                        FALSE - materialize results into a table
         """
         with MinWarning("error"):
             _validate(source_table, output_table, partition_expr, time_stamp, 
max_time)
    +        table_or_view = 'VIEW' if create_view or create_view is None else 
'TABLE'
    +        output_cols_to_project = '*' if output_cols is None else 
output_cols
     
    -        all_input_cols_str = ', '.join([i.strip() for i in 
get_cols(source_table, schema_madlib)])
    +        cols_to_project = get_column_names(schema_madlib, source_table, 
output_cols_to_project)
             session_id = 'session_id' if not is_var_valid(source_table, 
'session_id') else unique_string('session_id')
     
             # Create temp column names for intermediate columns.
             new_partition = unique_string('new_partition')
             new_session = unique_string('new_session')
     
             plpy.execute("""
    -                CREATE TABLE {output_table} AS
    +                CREATE {table_or_view} {output_table} AS
                         SELECT
    -                        {all_input_cols_str},
    +                        {cols_to_project},
                             CASE WHEN {time_stamp} IS NOT NULL
    -                             THEN SUM(CASE WHEN {new_partition} OR 
{new_session} THEN 1 END)
    -                                  OVER (PARTITION BY {partition_expr}
    -                                  ORDER BY {time_stamp})
    -                        END AS {session_id}
    +                        THEN SUM(CASE WHEN {new_partition} OR 
{new_session} THEN 1 END) OVER (PARTITION BY {partition_expr} ORDER BY 
{time_stamp}) END AS {session_id}
                         FROM (
                             SELECT *,
    -                            ROW_NUMBER() OVER (w) = 1
    -                                AND {time_stamp} IS NOT NULL AS 
{new_partition},
    -                            ({time_stamp} - LAG({time_stamp}, 1)
    -                                OVER (w)) > '{max_time}'::INTERVAL AS 
{new_session}
    -                        FROM {source_table}
    -                        WINDOW w AS (PARTITION BY {partition_expr}
    -                                     ORDER BY {time_stamp})
    +                            ROW_NUMBER() OVER (w) = 1 AND {time_stamp} IS 
NOT NULL AS {new_partition},
    +                            ({time_stamp}-LAG({time_stamp}, 1) OVER (w)) > 
'{max_time}'::INTERVAL AS {new_session}
    +                        FROM {source_table} WINDOW w AS (PARTITION BY 
{partition_expr} ORDER BY {time_stamp})
                             ) a
                 """.format(**locals()))
     
    +def get_column_names(schema_madlib, source_table, output_cols):
    +    """
    +        This method creates a string that can be used in the SQL statement 
to project the columns specified in the output_cols parameter.
    +
    +        Return:
    +        a string to be used in the SQL statement
    +    """
    +    table_columns_list = get_cols(source_table, schema_madlib)
    +    if output_cols.strip() == '*':
    +        output_cols_str = get_cols_str(table_columns_list)
    +    else:
    +        output_cols_list, output_cols_names = 
get_columns_from_expression(output_cols, table_columns_list)
    +        _validate_output_cols(source_table, output_cols_list)
    +        output_cols_str = ', '.join([output_cols_names[i] if 
output_cols_list[i] == '*' else output_cols_list[i] + ' AS ' + 
output_cols_names[i] 
    +            for i in range(len(output_cols_list))])
    +    return output_cols_str
    +
    +def create_column_name_from_expression(col_name, table_columns_list):
    +    if col_name == '*':
    +        return get_cols_str(table_columns_list)
    +    else:
    +        # Column name cannot have more than one pair of quotes in it. 
Removing any existing quotes, and then quoting the new string obtained.
    --- End diff --
    
    >   * "user id"<100 is named as "user id<100"
    >   * revenue>20 is named as "revenue>20"
    
    Ahh, ok. That was completely unclear to me. It would be good to add 
    comments/docstrings to the various functions explaining what they're doing.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-madlib pull request #49: Feature: Sessionize funtion - Phase 2

Reply via email to