[
https://issues.apache.org/jira/browse/MADLIB-1002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15311058#comment-15311058
]
Jim Nasby commented on MADLIB-1002:
-----------------------------------
I took a quick look at this based on the phase 1 work. After some refactoring,
the guts of what I came up with this. Note that I changed some of the parameter
names per my comments on https://github.com/apache/incubator-madlib/pull/44.
{code:SQL}
...
, SUM(CASE WHEN new_partition OR new_session THEN 1 END)
OVER(PARTITION BY {partition_expr}) AS session_id
FROM (
SELECT *
, ROW_NUMBER() OVER (w) = 1 AND {time_stamp} IS NOT NULL AS new_partition
, ({time_stamp}-LAG({time_stamp},1) OVER (w)) < '{minimum_delta}' AS
item_too_soon
, ({time_stamp}-LAG({time_stamp},1) OVER (w)) > '{time_out}' AS
new_session
FROM {source_table}
WINDOW w AS (PARTITION BY {partition_expr} ORDER BY {time_stamp})
) a
WHERE
-- Note that (NULL IS NOT TRUE) returns TRUE.
item_too_soon IS NOT TRUE
{code}
> Sessionization - Phase 3 (minimum time)
> ---------------------------------------
>
> Key: MADLIB-1002
> URL: https://issues.apache.org/jira/browse/MADLIB-1002
> Project: Apache MADlib
> Issue Type: New Feature
> Components: Module: Utilities
> Reporter: Frank McQuillan
> Priority: Minor
> Labels: gsoc2016, starter
> Fix For: v1.9.2
>
>
> Story
> As a data scientist, I want to perform session reconstruction on my data set,
> so that I can prepare for input into other algorithms like path functions, or
> predictive analytics algorithms.
> This is a follow on to
> https://issues.apache.org/jira/browse/MADLIB-909
> https://issues.apache.org/jira/browse/MADLIB-1001
> to add minimum time.
> Details
> Add min time to the existing params:
> Proposed interface changes:
> {code}
> sessionize (
> source_table,
> output_table,
> partition_expr,
> order_expr,
> time_stamp,
> time_out,
> min_time, -- new
> output_cols,
> create_view
> )
> {code}
> where
> min_time (optional)
> Minimum delta time that must elapse for an event to be considered a valid
> event (default=0). If an event happens in less than min_time since the last
> valid event, it does not get included in the current session and is dropped.
> Same units as time_stamp.
> Implementation notes
> 1) Should be specified in the same units as the time_out parameter.
> 2) Always compare against the last valid session event, not against one(s)
> that just got dropped.
> For an example of how min_time could work, see Aster Analytics sessionization
> function [1].
> References
> [1] Aster Analytics users guide, see "sessionize" function
> http://www.info.teradata.com/edownload.cfm?itemid=143450001
> http://www.info.teradata.com/templates/eSrchResults.cfm?txtpid=&txtrelno=&prodline=all&frmdt=&txtsrchstring=aster%20analytics&srtord=Desc&todt=&rdSort=Date
> https://www.youtube.com/watch?v=C760M9ttK9Q
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)