[ 
https://issues.apache.org/jira/browse/MADLIB-995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Frank McQuillan updated MADLIB-995:
-----------------------------------
    Description: 
Story

As a data scientist, I want to be able to define multiple symbols that result 
in overlapping partitions.

See
http://madlib.incubator.apache.org/docs/latest/group__grp__path.html
for a description of what a symbol is.

Currently in 1.9, overlapping partitions are not supported. The default is 
non-overlapping, where the path algo begins the next pattern search at the row 
that follows the last pattern match (like how grep works in UNIX).

In the case of overlapping, the path algo needs to find every occurrence of the 
pattern in the partition, regardless of whether it might have been part of a 
previously found match. This means one row can match multiple symbols in a 
given matched pattern so there is a dependency on 
https://issues.apache.org/jira/browse/MADLIB-943 .  There is (small) chance 
that this story is a no-op once 
https://issues.apache.org/jira/browse/MADLIB-943 is done.

Need to add a new optional BOOLEAN parameter to the interface called 
"overlapping_patterns".  Default is FALSE.

(While you are at it please fix the docs to indicate that the "persist_rows" 
param is optional with default FALSE.)

Acceptance

The attached data set and query should should produce the following output:

Event Timestamp User ID Age Group       Income Group    Gender  Region  
Household Size  Click Event     Purchase Event  Revenue Margin  Match ID
4/15/12 7:02    100821  1       4       Unknown West    3       1       1       
118     39      1
4/15/12 8:51    102201  3       3       Female  East    3       0       0       
0       0       1
4/15/12 9:28    101121  2       2       Unknown West    4       1       1       
103     32      1,2
4/15/12 10:19   103711  4       3       Female  Central 5       0       0       
0       0       2
4/15/12 11:40   100821  1       4       Unknown West    3       0       0       
0       0       2
4/16/12 2:12    100821  1       4       Unknown West    3       1       1       
153     26      3
4/16/12 4:20    102201  3       3       Female  East    3       0       0       
0       0       3
4/16/12 5:38    101121  2       2       Unknown West    4       1       0       
0       0       3
4/16/12 20:46   101121  2       2       Unknown West    4       1       1       
131     28      4
4/16/12 21:11   101331  2       4       Female  East    5       1       1       
127     27      4
4/16/12 22:35   101121  2       2       Unknown West    4       0       0       
0       0       4

There are 4 pattern matches.  The 1st and the 2nd overlap.




  was:
Story

As a data scientist, I want to be able to define multiple symbols that result 
in overlapping partitions.

See
http://madlib.incubator.apache.org/docs/latest/group__grp__path.html
for a description of what a symbol is.

Currently in 1.9, overlapping partitions are not supported. The default is 
non-overlapping, where the path algo begins the next pattern search at the row 
that follows the last pattern match (like how grep works in UNIX).

In the case of overlapping, the path algo needs to find every occurrence of the 
pattern in the partition, regardless of whether it might have been part of a 
previously found match. This means one row can match multiple symbols in a 
given matched pattern so there is a dependency on 
https://issues.apache.org/jira/browse/MADLIB-943 .  There is (small) chance 
that this story is a no-op once 
https://issues.apache.org/jira/browse/MADLIB-943 is done.

Need to add a new optional BOOLEAN parameter to the interface called 
"overlapping_patterns".  Default is FALSE.

(While you are at it please fix the docs to indicate that the "persist_rows" 
param is optional with default FALSE.)

Acceptance




> Path - overlapping partitions
> -----------------------------
>
>                 Key: MADLIB-995
>                 URL: https://issues.apache.org/jira/browse/MADLIB-995
>             Project: Apache MADlib
>          Issue Type: New Feature
>          Components: Module: Utilities
>            Reporter: Frank McQuillan
>             Fix For: v1.9.1
>
>
> Story
> As a data scientist, I want to be able to define multiple symbols that result 
> in overlapping partitions.
> See
> http://madlib.incubator.apache.org/docs/latest/group__grp__path.html
> for a description of what a symbol is.
> Currently in 1.9, overlapping partitions are not supported. The default is 
> non-overlapping, where the path algo begins the next pattern search at the 
> row that follows the last pattern match (like how grep works in UNIX).
> In the case of overlapping, the path algo needs to find every occurrence of 
> the pattern in the partition, regardless of whether it might have been part 
> of a previously found match. This means one row can match multiple symbols in 
> a given matched pattern so there is a dependency on 
> https://issues.apache.org/jira/browse/MADLIB-943 .  There is (small) chance 
> that this story is a no-op once 
> https://issues.apache.org/jira/browse/MADLIB-943 is done.
> Need to add a new optional BOOLEAN parameter to the interface called 
> "overlapping_patterns".  Default is FALSE.
> (While you are at it please fix the docs to indicate that the "persist_rows" 
> param is optional with default FALSE.)
> Acceptance
> The attached data set and query should should produce the following output:
> Event Timestamp       User ID Age Group       Income Group    Gender  Region  
> Household Size  Click Event     Purchase Event  Revenue Margin  Match ID
> 4/15/12 7:02  100821  1       4       Unknown West    3       1       1       
> 118     39      1
> 4/15/12 8:51  102201  3       3       Female  East    3       0       0       
> 0       0       1
> 4/15/12 9:28  101121  2       2       Unknown West    4       1       1       
> 103     32      1,2
> 4/15/12 10:19 103711  4       3       Female  Central 5       0       0       
> 0       0       2
> 4/15/12 11:40 100821  1       4       Unknown West    3       0       0       
> 0       0       2
> 4/16/12 2:12  100821  1       4       Unknown West    3       1       1       
> 153     26      3
> 4/16/12 4:20  102201  3       3       Female  East    3       0       0       
> 0       0       3
> 4/16/12 5:38  101121  2       2       Unknown West    4       1       0       
> 0       0       3
> 4/16/12 20:46 101121  2       2       Unknown West    4       1       1       
> 131     28      4
> 4/16/12 21:11 101331  2       4       Female  East    5       1       1       
> 127     27      4
> 4/16/12 22:35 101121  2       2       Unknown West    4       0       0       
> 0       0       4
> There are 4 pattern matches.  The 1st and the 2nd overlap.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to