[ 
https://issues.apache.org/jira/browse/MADLIB-909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Frank McQuillan updated MADLIB-909:
-----------------------------------
    Description: 
Story

As a data scientist, I want to perform session reconstruction on my data set, 
so that I can prepare for input into other algorithms like path functions, or 
predictive analytics algorithms.

Details

1)  The PDL Tools module sessionization module [1] is one example 
implementation.  Source code is located at [2].

2) How to sessionize.  PDL Tools uses a time based session reconstruction that 
defines a session as a sequence of events by a particular user where no more 
than n seconds has elapsed between successive events.   That is, if we don’t 
see an event from a user for n seconds, start a new session.   The requirement 
for  MADlib is similar but with the following additions:

* generalize partition and order expressions
* add a minimum number of seconds that must elapse between events for the 
session to be valid. As stated in [3], black bots may send requests at 
excessive rates so we want a way to be be able to filter them out.
* add option to persist rows with NULL time stamps

3) Proposed interface:

{code}
sessionize (
   source_table,
   output_table,
   partition_expr,
   order_expr,
   time_stamp,
   time_out, 
   min_time, 
   persist_nulls)
{code}
where

partition_expr
VARCHAR. The 'partition_expr' can be a single column or a list of 
comma-separated columns/expressions to divide all rows into groups, or 
partitions. Matching is applied across the rows that fall into t he same 
partition. This can be NULL or '' to indicate the matching is to be applied to 
the whole table.

order_expr
VARCHAR. This expression controls the order in which rows are processed or 
matched in a partition. For example, time is a common way to order partitions.

time_stamp
Column name with time used for sessionize calculation (often will be the same 
as order_expr but may not always be)

time_out
FLOAT  Number of seconds between subsequent events to define a sessions

min_time
FLOAT  Minimum number of seconds that must elapse between clicks in order for 
this session to be considered a real (human) session (default=0)

persist_nulls
BOOLEAN.  If TRUE, write the row to the output table even if time_stamp is NULL 
(and hence session id is NULL).  If FALSE, do not write row out to the output 
table if time-stamp is NULL.  Default is FALSE.

For an example of how min_time and persist_nulls could work, see Aster 
Analytics sessionization function [4].

Acceptance

1) New test cases in install-check and TINC. TINC tests should include output 
validation tests (manually verified) and negative tests.
2) Updated documentation and online help functions (online help refers to the 
documentation that is accessible directly via SQL).
3) All tests should pass on Pulse.
4) Code should be independently reviewed and tested.

References

[1]  PDL Tools sessionization module
http://pivotalsoftware.github.io/PDLTools/group__grp__sessionization.html

[2] PDL tools source code
https://github.com/pivotalsoftware/PDLTools

[3] Blog on bot signatures from Akamai
https://blogs.akamai.com/2013/06/identifying-and-mitigating-unwanted-bot-traffic.html

[4] Aster Analytics users guide, see "sessionize" function
http://www.info.teradata.com/edownload.cfm?itemid=143450001
http://www.info.teradata.com/templates/eSrchResults.cfm?txtpid=&txtrelno=&prodline=all&frmdt=&txtsrchstring=aster%20analytics&srtord=Desc&todt=&rdSort=Date
https://www.youtube.com/watch?v=C760M9ttK9Q

[5] General information on sessionization
https://en.wikipedia.org/wiki/Session_(web_analytics)

[6] See path function for partition and order by params
http://madlib.incubator.apache.org/docs/latest/group__grp__path.html



  was:
Story

As a data scientist, I want to perform session reconstruction on my data set, 
so that I can prepare for input into other algorithms like path functions, or 
predictive analytics algorithms.

Details

1)  The PDL Tools module sessionization module [1] is one example 
implementation.  Source code is located at [2].

2) How to sessionize.  PDL Tools uses a time based session reconstruction that 
defines a session as a sequence of events by a particular user where no more 
than n seconds has elapsed between successive events.   That is, if we don’t 
see an event from a user for n seconds, start a new session.   The requirement 
for  MADlib is similar but with the following additions:

* generalize partition and order expressions
* add a minimum number of seconds that must elapse between events for the 
session to be valid. As stated in [3], black bots may send requests at 
excessive rates so we want a way to be be able to filter them out.
* add option to persist rows with NULL time stamps

3) Proposed interface:

sessionize (
   source_table,
   output_table,
   partition_expr,
   order_expr,
   time_stamp,
   time_out, 
   min_time, 
   persist_nulls)

where

partition_expr
VARCHAR. The 'partition_expr' can be a single column or a list of 
comma-separated columns/expressions to divide all rows into groups, or 
partitions. Matching is applied across the rows that fall into t he same 
partition. This can be NULL or '' to indicate the matching is to be applied to 
the whole table.

order_expr
VARCHAR. This expression controls the order in which rows are processed or 
matched in a partition. For example, time is a common way to order partitions.

time_stamp
Column name with time used for sessionize calculation (often will be the same 
as order_expr but may not always be)

time_out
FLOAT  Number of seconds between subsequent events to define a sessions

min_time
FLOAT  Minimum number of seconds that must elapse between clicks in order for 
this session to be considered a real (human) session (default=0)

persist_nulls
BOOLEAN.  If TRUE, write the row to the output table even if time_stamp is NULL 
(and hence session id is NULL).  If FALSE, do not write row out to the output 
table if time-stamp is NULL.  Default is FALSE.

For an example of how min_time and persist_nulls could work, see Aster 
Analytics sessionization function [4].

Acceptance

1) New test cases in install-check and TINC. TINC tests should include output 
validation tests (manually verified) and negative tests.
2) Updated documentation and online help functions (online help refers to the 
documentation that is accessible directly via SQL).
3) All tests should pass on Pulse.
4) Code should be independently reviewed and tested.

References

[1]  PDL Tools sessionization module
http://pivotalsoftware.github.io/PDLTools/group__grp__sessionization.html

[2] PDL tools source code
https://github.com/pivotalsoftware/PDLTools

[3] Blog on bot signatures from Akamai
https://blogs.akamai.com/2013/06/identifying-and-mitigating-unwanted-bot-traffic.html

[4] Aster Analytics users guide, see "sessionize" function
http://www.info.teradata.com/edownload.cfm?itemid=143450001
http://www.info.teradata.com/templates/eSrchResults.cfm?txtpid=&txtrelno=&prodline=all&frmdt=&txtsrchstring=aster%20analytics&srtord=Desc&todt=&rdSort=Date
https://www.youtube.com/watch?v=C760M9ttK9Q

[5] General information on sessionization
https://en.wikipedia.org/wiki/Session_(web_analytics)

[6] See path function for partition and order by params
http://madlib.incubator.apache.org/docs/latest/group__grp__path.html




> Sessionization
> --------------
>
>                 Key: MADLIB-909
>                 URL: https://issues.apache.org/jira/browse/MADLIB-909
>             Project: Apache MADlib
>          Issue Type: New Feature
>          Components: Module: Utilities
>            Reporter: Frank McQuillan
>              Labels: gsoc2016, starter
>             Fix For: v1.9.1
>
>
> Story
> As a data scientist, I want to perform session reconstruction on my data set, 
> so that I can prepare for input into other algorithms like path functions, or 
> predictive analytics algorithms.
> Details
> 1)  The PDL Tools module sessionization module [1] is one example 
> implementation.  Source code is located at [2].
> 2) How to sessionize.  PDL Tools uses a time based session reconstruction 
> that defines a session as a sequence of events by a particular user where no 
> more than n seconds has elapsed between successive events.   That is, if we 
> don’t see an event from a user for n seconds, start a new session.   The 
> requirement for  MADlib is similar but with the following additions:
> * generalize partition and order expressions
> * add a minimum number of seconds that must elapse between events for the 
> session to be valid. As stated in [3], black bots may send requests at 
> excessive rates so we want a way to be be able to filter them out.
> * add option to persist rows with NULL time stamps
> 3) Proposed interface:
> {code}
> sessionize (
>    source_table,
>    output_table,
>    partition_expr,
>    order_expr,
>    time_stamp,
>    time_out, 
>    min_time, 
>    persist_nulls)
> {code}
> where
> partition_expr
> VARCHAR. The 'partition_expr' can be a single column or a list of 
> comma-separated columns/expressions to divide all rows into groups, or 
> partitions. Matching is applied across the rows that fall into t he same 
> partition. This can be NULL or '' to indicate the matching is to be applied 
> to the whole table.
> order_expr
> VARCHAR. This expression controls the order in which rows are processed or 
> matched in a partition. For example, time is a common way to order partitions.
> time_stamp
> Column name with time used for sessionize calculation (often will be the same 
> as order_expr but may not always be)
> time_out
> FLOAT  Number of seconds between subsequent events to define a sessions
> min_time
> FLOAT  Minimum number of seconds that must elapse between clicks in order for 
> this session to be considered a real (human) session (default=0)
> persist_nulls
> BOOLEAN.  If TRUE, write the row to the output table even if time_stamp is 
> NULL (and hence session id is NULL).  If FALSE, do not write row out to the 
> output table if time-stamp is NULL.  Default is FALSE.
> For an example of how min_time and persist_nulls could work, see Aster 
> Analytics sessionization function [4].
> Acceptance
> 1) New test cases in install-check and TINC. TINC tests should include output 
> validation tests (manually verified) and negative tests.
> 2) Updated documentation and online help functions (online help refers to the 
> documentation that is accessible directly via SQL).
> 3) All tests should pass on Pulse.
> 4) Code should be independently reviewed and tested.
> References
> [1]  PDL Tools sessionization module
> http://pivotalsoftware.github.io/PDLTools/group__grp__sessionization.html
> [2] PDL tools source code
> https://github.com/pivotalsoftware/PDLTools
> [3] Blog on bot signatures from Akamai
> https://blogs.akamai.com/2013/06/identifying-and-mitigating-unwanted-bot-traffic.html
> [4] Aster Analytics users guide, see "sessionize" function
> http://www.info.teradata.com/edownload.cfm?itemid=143450001
> http://www.info.teradata.com/templates/eSrchResults.cfm?txtpid=&txtrelno=&prodline=all&frmdt=&txtsrchstring=aster%20analytics&srtord=Desc&todt=&rdSort=Date
> https://www.youtube.com/watch?v=C760M9ttK9Q
> [5] General information on sessionization
> https://en.wikipedia.org/wiki/Session_(web_analytics)
> [6] See path function for partition and order by params
> http://madlib.incubator.apache.org/docs/latest/group__grp__path.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to