[
https://issues.apache.org/jira/browse/MADLIB-909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Frank McQuillan updated MADLIB-909:
-----------------------------------
Description:
Story
As a data scientist, I want to perform session reconstruction on my data set,
so that I can prepare for input into other algorithms like path functions, or
predictive analytics algorithms.
Details
1) The PDL Tools module sessionization module [1] is one example
implementation. Source code is located at [2]. Also see [7].
2) How to sessionize. PDL Tools uses a time based session reconstruction that
defines a session as a sequence of events by a particular user where no more
than n seconds has elapsed between successive events. That is, if we don’t
see an event from a user for n seconds, start a new session. The requirement
for MADlib is similar but with the following addition:
* generalize partition expression
3) Proposed interface:
{code}
sessionize (
source_table,
output_table,
partition_expr,
time_stamp,
max_time)
{code}
where
output_table
add 2 new columns to the source_table: session_id and new_session:
* session_id=1,2, ...n where n is the number of sessions in the partition
partition_expr
VARCHAR. The 'partition_expr' can be a single column or a list of
comma-separated columns/expressions to divide all rows into groups, or
partitions. Matching is applied across the rows that fall into t he same
partition. This can be NULL or '' to indicate the matching is to be applied to
the whole table.
time_stamp
Column name with time used for sessionize calculation
max_time
Delta time between subsequent events to define a sessions, i.e., session
timeout.
Questions
1) Q: Do we need separate 'order_expr' and 'time_stamp' columns? Aster does
it this way.
A: No, we can't come up with a reason why a user would need this. If we want
to add later, we can add as an optional parameter.
2) Q: What to do if negative delta_t between events?
A: Do not include in session and output a warning message.
References
[1] PDL Tools sessionization module
http://pivotalsoftware.github.io/PDLTools/group__grp__sessionization.html
[2] PDL tools source code
https://github.com/pivotalsoftware/PDLTools
[3] Blog on bot signatures from Akamai
https://blogs.akamai.com/2013/06/identifying-and-mitigating-unwanted-bot-traffic.html
[4] Aster Analytics users guide, see "sessionize" function
http://www.info.teradata.com/edownload.cfm?itemid=143450001
http://www.info.teradata.com/templates/eSrchResults.cfm?txtpid=&txtrelno=&prodline=all&frmdt=&txtsrchstring=aster%20analytics&srtord=Desc&todt=&rdSort=Date
https://www.youtube.com/watch?v=C760M9ttK9Q
[5] General information on sessionization
https://en.wikipedia.org/wiki/Session_(web_analytics)
[6] See path function for partition and order by params
http://madlib.incubator.apache.org/docs/latest/group__grp__path.html
[7] SQL sessionization example from blog
https://blog.pivotal.io/pivotal/products/time-series-analysis-1-introduction-to-window-functions
[8] Postgres example of SQL based sessionization
http://randyzwitch.com/sessionizing-log-data-sql/
was:
Story
As a data scientist, I want to perform session reconstruction on my data set,
so that I can prepare for input into other algorithms like path functions, or
predictive analytics algorithms.
Details
1) The PDL Tools module sessionization module [1] is one example
implementation. Source code is located at [2].
2) How to sessionize. PDL Tools uses a time based session reconstruction that
defines a session as a sequence of events by a particular user where no more
than n seconds has elapsed between successive events. That is, if we don’t
see an event from a user for n seconds, start a new session. The requirement
for MADlib is similar but with the following addition:
* generalize partition and order expressions
3) Proposed interface:
{code}
sessionize (
source_table,
output_table,
partition_expr,
order_expr,
time_stamp,
time_out)
{code}
where
partition_expr
VARCHAR. The 'partition_expr' can be a single column or a list of
comma-separated columns/expressions to divide all rows into groups, or
partitions. Matching is applied across the rows that fall into t he same
partition. This can be NULL or '' to indicate the matching is to be applied to
the whole table.
order_expr
VARCHAR. This expression controls the order in which rows are processed or
matched in a partition. For example, time is a common way to order partitions.
time_stamp
Column name with time used for sessionize calculation (often will be the same
as order_expr but may not always be)
time_out
Number of seconds between subsequent events to define a sessions. Same units
as time_stamp.
Acceptance
1) New test cases in install-check and TINC. TINC tests should include output
validation tests (manually verified) and negative tests.
2) Updated documentation and online help functions (online help refers to the
documentation that is accessible directly via SQL).
3) All tests should pass on Pulse.
4) Code should be independently reviewed and tested.
References
[1] PDL Tools sessionization module
http://pivotalsoftware.github.io/PDLTools/group__grp__sessionization.html
[2] PDL tools source code
https://github.com/pivotalsoftware/PDLTools
[3] Blog on bot signatures from Akamai
https://blogs.akamai.com/2013/06/identifying-and-mitigating-unwanted-bot-traffic.html
[4] Aster Analytics users guide, see "sessionize" function
http://www.info.teradata.com/edownload.cfm?itemid=143450001
http://www.info.teradata.com/templates/eSrchResults.cfm?txtpid=&txtrelno=&prodline=all&frmdt=&txtsrchstring=aster%20analytics&srtord=Desc&todt=&rdSort=Date
https://www.youtube.com/watch?v=C760M9ttK9Q
[5] General information on sessionization
https://en.wikipedia.org/wiki/Session_(web_analytics)
[6] See path function for partition and order by params
http://madlib.incubator.apache.org/docs/latest/group__grp__path.html
> Sessionization - Phase 1
> ------------------------
>
> Key: MADLIB-909
> URL: https://issues.apache.org/jira/browse/MADLIB-909
> Project: Apache MADlib
> Issue Type: New Feature
> Components: Module: Utilities
> Reporter: Frank McQuillan
> Assignee: Nandish Jayaram
> Labels: gsoc2016, starter
> Fix For: v1.9.1
>
>
> Story
> As a data scientist, I want to perform session reconstruction on my data set,
> so that I can prepare for input into other algorithms like path functions, or
> predictive analytics algorithms.
> Details
> 1) The PDL Tools module sessionization module [1] is one example
> implementation. Source code is located at [2]. Also see [7].
> 2) How to sessionize. PDL Tools uses a time based session reconstruction
> that defines a session as a sequence of events by a particular user where no
> more than n seconds has elapsed between successive events. That is, if we
> don’t see an event from a user for n seconds, start a new session. The
> requirement for MADlib is similar but with the following addition:
> * generalize partition expression
> 3) Proposed interface:
> {code}
> sessionize (
> source_table,
> output_table,
> partition_expr,
> time_stamp,
> max_time)
> {code}
> where
> output_table
> add 2 new columns to the source_table: session_id and new_session:
> * session_id=1,2, ...n where n is the number of sessions in the partition
> partition_expr
> VARCHAR. The 'partition_expr' can be a single column or a list of
> comma-separated columns/expressions to divide all rows into groups, or
> partitions. Matching is applied across the rows that fall into t he same
> partition. This can be NULL or '' to indicate the matching is to be applied
> to the whole table.
> time_stamp
> Column name with time used for sessionize calculation
> max_time
> Delta time between subsequent events to define a sessions, i.e., session
> timeout.
> Questions
> 1) Q: Do we need separate 'order_expr' and 'time_stamp' columns? Aster does
> it this way.
> A: No, we can't come up with a reason why a user would need this. If we want
> to add later, we can add as an optional parameter.
> 2) Q: What to do if negative delta_t between events?
> A: Do not include in session and output a warning message.
> References
> [1] PDL Tools sessionization module
> http://pivotalsoftware.github.io/PDLTools/group__grp__sessionization.html
> [2] PDL tools source code
> https://github.com/pivotalsoftware/PDLTools
> [3] Blog on bot signatures from Akamai
> https://blogs.akamai.com/2013/06/identifying-and-mitigating-unwanted-bot-traffic.html
> [4] Aster Analytics users guide, see "sessionize" function
> http://www.info.teradata.com/edownload.cfm?itemid=143450001
> http://www.info.teradata.com/templates/eSrchResults.cfm?txtpid=&txtrelno=&prodline=all&frmdt=&txtsrchstring=aster%20analytics&srtord=Desc&todt=&rdSort=Date
> https://www.youtube.com/watch?v=C760M9ttK9Q
> [5] General information on sessionization
> https://en.wikipedia.org/wiki/Session_(web_analytics)
> [6] See path function for partition and order by params
> http://madlib.incubator.apache.org/docs/latest/group__grp__path.html
> [7] SQL sessionization example from blog
> https://blog.pivotal.io/pivotal/products/time-series-analysis-1-introduction-to-window-functions
> [8] Postgres example of SQL based sessionization
> http://randyzwitch.com/sessionizing-log-data-sql/
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)