[
https://issues.apache.org/jira/browse/MADLIB-909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15308064#comment-15308064
]
ASF GitHub Bot commented on MADLIB-909:
---------------------------------------
GitHub user njayaram2 opened a pull request:
https://github.com/apache/incubator-madlib/pull/44
Feature: Sessionize funtion
JIRA: MADLIB-909
This contains the implementation of the phase 1 sessionize function.
The current input parameters are the input and output table names,
along with a partition expression, the time_stamp column name and
the time_out period to consider for sessionization. The implementation
uses the window function to perform sessionization.
This commit also contains the install check file for sessionization.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/njayaram2/incubator-madlib
feature/sessionization
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/incubator-madlib/pull/44.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #44
----
commit b0d3b318402d62a7c1aab300f183ea049afb538b
Author: Nandish Jayaram <[email protected]>
Date: 2016-05-31T16:30:44Z
Feature: Sessionize funtion
JIRA: MADLIB-909
This contains the implementation of the phase 1 sessionize function.
The current input parameters are the input and output table names,
along with a partition expression, the time_stamp column name and
the time_out period to consider for sessionization. The implementation
uses the window function to perform sessionization.
This commit also contains the install check file for sessionization.
----
> Sessionization - Phase 1
> ------------------------
>
> Key: MADLIB-909
> URL: https://issues.apache.org/jira/browse/MADLIB-909
> Project: Apache MADlib
> Issue Type: New Feature
> Components: Module: Utilities
> Reporter: Frank McQuillan
> Labels: gsoc2016, starter
> Fix For: v1.9.1
>
>
> Story
> As a data scientist, I want to perform session reconstruction on my data set,
> so that I can prepare for input into other algorithms like path functions, or
> predictive analytics algorithms.
> Details
> 1) The PDL Tools module sessionization module [1] is one example
> implementation. Source code is located at [2].
> 2) How to sessionize. PDL Tools uses a time based session reconstruction
> that defines a session as a sequence of events by a particular user where no
> more than n seconds has elapsed between successive events. That is, if we
> don’t see an event from a user for n seconds, start a new session. The
> requirement for MADlib is similar but with the following addition:
> * generalize partition and order expressions
> 3) Proposed interface:
> {code}
> sessionize (
> source_table,
> output_table,
> partition_expr,
> order_expr,
> time_stamp,
> time_out)
> {code}
> where
> partition_expr
> VARCHAR. The 'partition_expr' can be a single column or a list of
> comma-separated columns/expressions to divide all rows into groups, or
> partitions. Matching is applied across the rows that fall into t he same
> partition. This can be NULL or '' to indicate the matching is to be applied
> to the whole table.
> order_expr
> VARCHAR. This expression controls the order in which rows are processed or
> matched in a partition. For example, time is a common way to order partitions.
> time_stamp
> Column name with time used for sessionize calculation (often will be the same
> as order_expr but may not always be)
> time_out
> FLOAT Number of seconds between subsequent events to define a sessions
> Acceptance
> 1) New test cases in install-check and TINC. TINC tests should include output
> validation tests (manually verified) and negative tests.
> 2) Updated documentation and online help functions (online help refers to the
> documentation that is accessible directly via SQL).
> 3) All tests should pass on Pulse.
> 4) Code should be independently reviewed and tested.
> References
> [1] PDL Tools sessionization module
> http://pivotalsoftware.github.io/PDLTools/group__grp__sessionization.html
> [2] PDL tools source code
> https://github.com/pivotalsoftware/PDLTools
> [3] Blog on bot signatures from Akamai
> https://blogs.akamai.com/2013/06/identifying-and-mitigating-unwanted-bot-traffic.html
> [4] Aster Analytics users guide, see "sessionize" function
> http://www.info.teradata.com/edownload.cfm?itemid=143450001
> http://www.info.teradata.com/templates/eSrchResults.cfm?txtpid=&txtrelno=&prodline=all&frmdt=&txtsrchstring=aster%20analytics&srtord=Desc&todt=&rdSort=Date
> https://www.youtube.com/watch?v=C760M9ttK9Q
> [5] General information on sessionization
> https://en.wikipedia.org/wiki/Session_(web_analytics)
> [6] See path function for partition and order by params
> http://madlib.incubator.apache.org/docs/latest/group__grp__path.html
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)