[
https://issues.apache.org/jira/browse/MADLIB-909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Frank McQuillan updated MADLIB-909:
-----------------------------------
Description:
Story
As a data scientist, I want to perform session reconstruction on my data set,
so that I can prepare for input into other algorithms like path functions, or
predictive analytics algorithms.
Details
1) The PDL Tools module sessionization module [1] is one example
implementation. Source code is located at [2]. Also see [7].
2) How to sessionize. PDL Tools uses a time based session reconstruction that
defines a session as a sequence of events by a particular user where no more
than n seconds has elapsed between successive events. That is, if we don’t
see an event from a user for n seconds, start a new session. The requirement
for MADlib is similar but with the following addition:
* generalize partition expression
3) Proposed interface:
{code}
sessionize (
source_table,
output_table,
partition_expr,
time_stamp,
max_time)
{code}
where
output_table
add 2 new columns to the source_table: session_id and new_session:
* session_id=1,2, ...n where n is the number of sessions in the partition
partition_expr
VARCHAR. The 'partition_expr' can be a single column or a list of
comma-separated columns/expressions to divide all rows into groups, or
partitions. Matching is applied across the rows that fall into t he same
partition. This can be NULL or '' to indicate the matching is to be applied to
the whole table.
time_stamp
Column name with time used for sessionize calculation.
max_time
Delta time between subsequent events to define a sessions, i.e., session
timeout.
Questions
1) Q: Do we need separate 'order_expr' and 'time_stamp' columns? Aster does
it this way.
A: No, we can't come up with a reason why a user would need this. If we want
to add later, we can add as an optional parameter.
2) Q: What to do if negative delta_t between events?
A: Do not include in session and output a warning message.
References
[1] PDL Tools sessionization module
http://pivotalsoftware.github.io/PDLTools/group__grp__sessionization.html
[2] PDL tools source code
https://github.com/pivotalsoftware/PDLTools
[3] Blog on bot signatures from Akamai
https://blogs.akamai.com/2013/06/identifying-and-mitigating-unwanted-bot-traffic.html
[4] Aster Analytics users guide, see "sessionize" function
http://www.info.teradata.com/edownload.cfm?itemid=143450001
http://www.info.teradata.com/templates/eSrchResults.cfm?txtpid=&txtrelno=&prodline=all&frmdt=&txtsrchstring=aster%20analytics&srtord=Desc&todt=&rdSort=Date
https://www.youtube.com/watch?v=C760M9ttK9Q
[5] General information on sessionization
https://en.wikipedia.org/wiki/Session_(web_analytics)
[6] See path function for partition and order by params
http://madlib.incubator.apache.org/docs/latest/group__grp__path.html
[7] SQL sessionization example from blog
https://blog.pivotal.io/pivotal/products/time-series-analysis-1-introduction-to-window-functions
[8] Postgres example of SQL based sessionization
http://randyzwitch.com/sessionizing-log-data-sql/
was:
Story
As a data scientist, I want to perform session reconstruction on my data set,
so that I can prepare for input into other algorithms like path functions, or
predictive analytics algorithms.
Details
1) The PDL Tools module sessionization module [1] is one example
implementation. Source code is located at [2]. Also see [7].
2) How to sessionize. PDL Tools uses a time based session reconstruction that
defines a session as a sequence of events by a particular user where no more
than n seconds has elapsed between successive events. That is, if we don’t
see an event from a user for n seconds, start a new session. The requirement
for MADlib is similar but with the following addition:
* generalize partition expression
3) Proposed interface:
{code}
sessionize (
source_table,
output_table,
partition_expr,
time_stamp,
max_time)
{code}
where
output_table
add 2 new columns to the source_table: session_id and new_session:
* session_id=1,2, ...n where n is the number of sessions in the partition
partition_expr
VARCHAR. The 'partition_expr' can be a single column or a list of
comma-separated columns/expressions to divide all rows into groups, or
partitions. Matching is applied across the rows that fall into t he same
partition. This can be NULL or '' to indicate the matching is to be applied to
the whole table.
time_stamp
Column name with time used for sessionize calculation. Can also be a
PostgreSQL ORDER BY expression.
max_time
Delta time between subsequent events to define a sessions, i.e., session
timeout.
Questions
1) Q: Do we need separate 'order_expr' and 'time_stamp' columns? Aster does
it this way.
A: No, we can't come up with a reason why a user would need this. If we want
to add later, we can add as an optional parameter.
2) Q: What to do if negative delta_t between events?
A: Do not include in session and output a warning message.
References
[1] PDL Tools sessionization module
http://pivotalsoftware.github.io/PDLTools/group__grp__sessionization.html
[2] PDL tools source code
https://github.com/pivotalsoftware/PDLTools
[3] Blog on bot signatures from Akamai
https://blogs.akamai.com/2013/06/identifying-and-mitigating-unwanted-bot-traffic.html
[4] Aster Analytics users guide, see "sessionize" function
http://www.info.teradata.com/edownload.cfm?itemid=143450001
http://www.info.teradata.com/templates/eSrchResults.cfm?txtpid=&txtrelno=&prodline=all&frmdt=&txtsrchstring=aster%20analytics&srtord=Desc&todt=&rdSort=Date
https://www.youtube.com/watch?v=C760M9ttK9Q
[5] General information on sessionization
https://en.wikipedia.org/wiki/Session_(web_analytics)
[6] See path function for partition and order by params
http://madlib.incubator.apache.org/docs/latest/group__grp__path.html
[7] SQL sessionization example from blog
https://blog.pivotal.io/pivotal/products/time-series-analysis-1-introduction-to-window-functions
[8] Postgres example of SQL based sessionization
http://randyzwitch.com/sessionizing-log-data-sql/
> Sessionization - Phase 1
> ------------------------
>
> Key: MADLIB-909
> URL: https://issues.apache.org/jira/browse/MADLIB-909
> Project: Apache MADlib
> Issue Type: New Feature
> Components: Module: Utilities
> Reporter: Frank McQuillan
> Assignee: Nandish Jayaram
> Labels: gsoc2016, starter
> Fix For: v1.9.1
>
>
> Story
> As a data scientist, I want to perform session reconstruction on my data set,
> so that I can prepare for input into other algorithms like path functions, or
> predictive analytics algorithms.
> Details
> 1) The PDL Tools module sessionization module [1] is one example
> implementation. Source code is located at [2]. Also see [7].
> 2) How to sessionize. PDL Tools uses a time based session reconstruction
> that defines a session as a sequence of events by a particular user where no
> more than n seconds has elapsed between successive events. That is, if we
> don’t see an event from a user for n seconds, start a new session. The
> requirement for MADlib is similar but with the following addition:
> * generalize partition expression
> 3) Proposed interface:
> {code}
> sessionize (
> source_table,
> output_table,
> partition_expr,
> time_stamp,
> max_time)
> {code}
> where
> output_table
> add 2 new columns to the source_table: session_id and new_session:
> * session_id=1,2, ...n where n is the number of sessions in the partition
> partition_expr
> VARCHAR. The 'partition_expr' can be a single column or a list of
> comma-separated columns/expressions to divide all rows into groups, or
> partitions. Matching is applied across the rows that fall into t he same
> partition. This can be NULL or '' to indicate the matching is to be applied
> to the whole table.
> time_stamp
> Column name with time used for sessionize calculation.
> max_time
> Delta time between subsequent events to define a sessions, i.e., session
> timeout.
> Questions
> 1) Q: Do we need separate 'order_expr' and 'time_stamp' columns? Aster does
> it this way.
> A: No, we can't come up with a reason why a user would need this. If we want
> to add later, we can add as an optional parameter.
> 2) Q: What to do if negative delta_t between events?
> A: Do not include in session and output a warning message.
> References
> [1] PDL Tools sessionization module
> http://pivotalsoftware.github.io/PDLTools/group__grp__sessionization.html
> [2] PDL tools source code
> https://github.com/pivotalsoftware/PDLTools
> [3] Blog on bot signatures from Akamai
> https://blogs.akamai.com/2013/06/identifying-and-mitigating-unwanted-bot-traffic.html
> [4] Aster Analytics users guide, see "sessionize" function
> http://www.info.teradata.com/edownload.cfm?itemid=143450001
> http://www.info.teradata.com/templates/eSrchResults.cfm?txtpid=&txtrelno=&prodline=all&frmdt=&txtsrchstring=aster%20analytics&srtord=Desc&todt=&rdSort=Date
> https://www.youtube.com/watch?v=C760M9ttK9Q
> [5] General information on sessionization
> https://en.wikipedia.org/wiki/Session_(web_analytics)
> [6] See path function for partition and order by params
> http://madlib.incubator.apache.org/docs/latest/group__grp__path.html
> [7] SQL sessionization example from blog
> https://blog.pivotal.io/pivotal/products/time-series-analysis-1-introduction-to-window-functions
> [8] Postgres example of SQL based sessionization
> http://randyzwitch.com/sessionizing-log-data-sql/
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)