[ 
https://issues.apache.org/jira/browse/MADLIB-1001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Frank McQuillan updated MADLIB-1001:
------------------------------------
    Description: 
Story

As a data scientist, I want to perform session reconstruction on my data set, 
so that I can prepare for input into other algorithms like path functions, or 
predictive analytics algorithms.

This is a follow on to 
https://issues.apache.org/jira/browse/MADLIB-909
to add optional output controls.

Details 

Proposed interface changes:

params (optional)
TEXT, default: NULL. Parameters for sessionization in a comma-delimited string 
of key-value pairs. See the description below for details.

Parameters

Parameters in this section are supplied in the params argument as a string 
containing a comma-delimited list of key-value pairs. All of these named 
parameters are optional, and their order does not matter. You must use the 
format <param_name> = <value> to specify the value of a parameter, otherwise 
the parameter is ignored.

{code}
‘output_all_cols = <value>,
 create_view = <value>’
{code} 

Parameters

output_all_cols (Boolean)
        BOOLEAN default: FALSE. Controls which columns are output.  If FALSE, 
only the partition, time stamp and the generated session ID columns are output. 
 (The assumption is that the partition columns together with the time stamp 
column will be sufficient to perform a join with the input table.)  If TRUE, 
all columns from the source table are output in addition to the generated 
session ID.

create_view (Boolean)
        BOOLEAN default: TRUE. Determines whether to create a view or 
materialize a table as output. If you only needed session info once, creating a 
view could be significantly faster than materializing as a table.

  was:
Story

As a data scientist, I want to perform session reconstruction on my data set, 
so that I can prepare for input into other algorithms like path functions, or 
predictive analytics algorithms.

Details

1)  The PDL Tools module sessionization module [1] is one example 
implementation.  Source code is located at [2].

2) How to sessionize.  PDL Tools uses a time based session reconstruction that 
defines a session as a sequence of events by a particular user where no more 
than n seconds has elapsed between successive events.   That is, if we don’t 
see an event from a user for n seconds, start a new session.   The requirement 
for MADlib is similar but with the following addition:
* generalize partition and order expressions

3) Proposed interface:

{code}
sessionize (
   source_table,
   output_table,
   partition_expr,
   order_expr,
   time_stamp,
   time_out)
{code}
where

partition_expr
VARCHAR. The 'partition_expr' can be a single column or a list of 
comma-separated columns/expressions to divide all rows into groups, or 
partitions. Matching is applied across the rows that fall into t he same 
partition. This can be NULL or '' to indicate the matching is to be applied to 
the whole table.

order_expr
VARCHAR. This expression controls the order in which rows are processed or 
matched in a partition. For example, time is a common way to order partitions.

time_stamp
Column name with time used for sessionize calculation (often will be the same 
as order_expr but may not always be)

time_out
FLOAT  Number of seconds between subsequent events to define a sessions

Acceptance

1) New test cases in install-check and TINC. TINC tests should include output 
validation tests (manually verified) and negative tests.
2) Updated documentation and online help functions (online help refers to the 
documentation that is accessible directly via SQL).
3) All tests should pass on Pulse.
4) Code should be independently reviewed and tested.

References

[1]  PDL Tools sessionization module
http://pivotalsoftware.github.io/PDLTools/group__grp__sessionization.html

[2] PDL tools source code
https://github.com/pivotalsoftware/PDLTools

[3] Blog on bot signatures from Akamai
https://blogs.akamai.com/2013/06/identifying-and-mitigating-unwanted-bot-traffic.html

[4] Aster Analytics users guide, see "sessionize" function
http://www.info.teradata.com/edownload.cfm?itemid=143450001
http://www.info.teradata.com/templates/eSrchResults.cfm?txtpid=&txtrelno=&prodline=all&frmdt=&txtsrchstring=aster%20analytics&srtord=Desc&todt=&rdSort=Date
https://www.youtube.com/watch?v=C760M9ttK9Q

[5] General information on sessionization
https://en.wikipedia.org/wiki/Session_(web_analytics)

[6] See path function for partition and order by params
http://madlib.incubator.apache.org/docs/latest/group__grp__path.html




> Sessionization - Phase 2
> ------------------------
>
>                 Key: MADLIB-1001
>                 URL: https://issues.apache.org/jira/browse/MADLIB-1001
>             Project: Apache MADlib
>          Issue Type: New Feature
>          Components: Module: Utilities
>            Reporter: Frank McQuillan
>            Assignee: Nandish Jayaram
>              Labels: gsoc2016, starter
>             Fix For: v1.9.1
>
>
> Story
> As a data scientist, I want to perform session reconstruction on my data set, 
> so that I can prepare for input into other algorithms like path functions, or 
> predictive analytics algorithms.
> This is a follow on to 
> https://issues.apache.org/jira/browse/MADLIB-909
> to add optional output controls.
> Details 
> Proposed interface changes:
> params (optional)
> TEXT, default: NULL. Parameters for sessionization in a comma-delimited 
> string of key-value pairs. See the description below for details.
> Parameters
> Parameters in this section are supplied in the params argument as a string 
> containing a comma-delimited list of key-value pairs. All of these named 
> parameters are optional, and their order does not matter. You must use the 
> format <param_name> = <value> to specify the value of a parameter, otherwise 
> the parameter is ignored.
> {code}
> ‘output_all_cols = <value>,
>  create_view = <value>’
> {code} 
> Parameters
> output_all_cols (Boolean)
>       BOOLEAN default: FALSE. Controls which columns are output.  If FALSE, 
> only the partition, time stamp and the generated session ID columns are 
> output.  (The assumption is that the partition columns together with the time 
> stamp column will be sufficient to perform a join with the input table.)  If 
> TRUE, all columns from the source table are output in addition to the 
> generated session ID.
> create_view (Boolean)
>       BOOLEAN default: TRUE. Determines whether to create a view or 
> materialize a table as output. If you only needed session info once, creating 
> a view could be significantly faster than materializing as a table.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to