[ 
https://issues.apache.org/jira/browse/MADLIB-909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15313366#comment-15313366
 ] 

Nandish Jayaram commented on MADLIB-909:
----------------------------------------

I have made the necessary changes. Awaiting more comments on the pull request:
https://github.com/apache/incubator-madlib/pull/44/files

> Sessionization - Phase 1
> ------------------------
>
>                 Key: MADLIB-909
>                 URL: https://issues.apache.org/jira/browse/MADLIB-909
>             Project: Apache MADlib
>          Issue Type: New Feature
>          Components: Module: Utilities
>            Reporter: Frank McQuillan
>            Assignee: Nandish Jayaram
>              Labels: gsoc2016, starter
>             Fix For: v1.9.1
>
>
> Story
> As a data scientist, I want to perform session reconstruction on my data set, 
> so that I can prepare for input into other algorithms like path functions, or 
> predictive analytics algorithms.
> Details
> 1)  The PDL Tools module sessionization module [1] is one example 
> implementation.  Source code is located at [2].
> 2) How to sessionize.  PDL Tools uses a time based session reconstruction 
> that defines a session as a sequence of events by a particular user where no 
> more than n seconds has elapsed between successive events.   That is, if we 
> don’t see an event from a user for n seconds, start a new session.   The 
> requirement for MADlib is similar but with the following addition:
> * generalize partition and order expressions
> 3) Proposed interface:
> {code}
> sessionize (
>    source_table,
>    output_table,
>    partition_expr,
>    order_expr,
>    time_stamp,
>    time_out)
> {code}
> where
> partition_expr
> VARCHAR. The 'partition_expr' can be a single column or a list of 
> comma-separated columns/expressions to divide all rows into groups, or 
> partitions. Matching is applied across the rows that fall into t he same 
> partition. This can be NULL or '' to indicate the matching is to be applied 
> to the whole table.
> order_expr
> VARCHAR. This expression controls the order in which rows are processed or 
> matched in a partition. For example, time is a common way to order partitions.
> time_stamp
> Column name with time used for sessionize calculation (often will be the same 
> as order_expr but may not always be)
> time_out
>  Number of seconds between subsequent events to define a sessions.  Same 
> units as time_stamp.
> Acceptance
> 1) New test cases in install-check and TINC. TINC tests should include output 
> validation tests (manually verified) and negative tests.
> 2) Updated documentation and online help functions (online help refers to the 
> documentation that is accessible directly via SQL).
> 3) All tests should pass on Pulse.
> 4) Code should be independently reviewed and tested.
> References
> [1]  PDL Tools sessionization module
> http://pivotalsoftware.github.io/PDLTools/group__grp__sessionization.html
> [2] PDL tools source code
> https://github.com/pivotalsoftware/PDLTools
> [3] Blog on bot signatures from Akamai
> https://blogs.akamai.com/2013/06/identifying-and-mitigating-unwanted-bot-traffic.html
> [4] Aster Analytics users guide, see "sessionize" function
> http://www.info.teradata.com/edownload.cfm?itemid=143450001
> http://www.info.teradata.com/templates/eSrchResults.cfm?txtpid=&txtrelno=&prodline=all&frmdt=&txtsrchstring=aster%20analytics&srtord=Desc&todt=&rdSort=Date
> https://www.youtube.com/watch?v=C760M9ttK9Q
> [5] General information on sessionization
> https://en.wikipedia.org/wiki/Session_(web_analytics)
> [6] See path function for partition and order by params
> http://madlib.incubator.apache.org/docs/latest/group__grp__path.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to