[
https://issues.apache.org/jira/browse/MADLIB-995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15399790#comment-15399790
]
Frank McQuillan commented on MADLIB-995:
----------------------------------------
This seems to work from my testing. For the example in the attachment:
{code:sql}
DROP TABLE IF EXISTS path_output, path_output_tuples;
SELECT madlib.path(
'weblog', -- Name of the table
'path_output', -- Table name to store the path results
NULL, -- No partitions
'event_timestamp ASC', -- Time asc
$$ FEMALE:=gender='Female',
UNKNOWN:=gender='Unknown',
MALE:=gender='Male'
$$, -- Definition of various symbols used in the pattern definition
'(UNKNOWN)(FEMALE)(UNKNOWN)',
NULL, -- No agg
TRUE, -- Persist matches
TRUE -- overlapping patterns
);
SELECT * FROM path_output_tuples ORDER BY match_id, event_timestamp ASC;
{code}
produces:
{code}
event_timestamp | user_id | age_group | income_group | gender | region |
household_size | click_event | purchase_event | revenue | margin | symbol |
match_id
---------------------+---------+-----------+--------------+---------+---------+----------------+-------------+----------------+---------+--------+---------+----------
2012-04-15 07:02:00 | 100821 | 1 | 4 | Unknown | West |
3 | 1 | 1 | 118 | 39 | UNKNOWN |
1
2012-04-15 08:51:00 | 102201 | 3 | 3 | Female | East |
3 | 0 | 0 | 0 | 0 | FEMALE |
1
2012-04-15 09:28:00 | 101121 | 2 | 2 | Unknown | West |
4 | 1 | 1 | 103 | 32 | UNKNOWN |
1
2012-04-15 09:28:00 | 101121 | 2 | 2 | Unknown | West |
4 | 1 | 1 | 103 | 32 | UNKNOWN |
2
2012-04-15 10:19:00 | 103711 | 4 | 3 | Female | Central |
5 | 0 | 0 | 0 | 0 | FEMALE |
2
2012-04-15 11:40:00 | 100821 | 1 | 4 | Unknown | West |
3 | 0 | 0 | 0 | 0 | UNKNOWN |
2
2012-04-16 02:12:00 | 100821 | 1 | 4 | Unknown | West |
3 | 1 | 1 | 153 | 26 | UNKNOWN |
3
2012-04-16 04:20:00 | 102201 | 3 | 3 | Female | East |
3 | 0 | 0 | 0 | 0 | FEMALE |
3
2012-04-16 05:38:00 | 101121 | 2 | 2 | Unknown | West |
4 | 1 | 0 | 0 | 0 | UNKNOWN |
3
2012-04-16 20:46:00 | 101121 | 2 | 2 | Unknown | West |
4 | 1 | 1 | 131 | 28 | UNKNOWN |
4
2012-04-16 21:11:00 | 101331 | 2 | 4 | Female | East |
5 | 1 | 1 | 127 | 27 | FEMALE |
4
2012-04-16 22:35:00 | 101121 | 2 | 2 | Unknown | West |
4 | 0 | 0 | 0 | 0 | UNKNOWN |
4
(12 rows)
{code}
as expected.
> Path - overlapping partitions
> -----------------------------
>
> Key: MADLIB-995
> URL: https://issues.apache.org/jira/browse/MADLIB-995
> Project: Apache MADlib
> Issue Type: New Feature
> Components: Module: Utilities
> Reporter: Frank McQuillan
> Fix For: v1.9.1
>
> Attachments: Ecommerce data set for path test 3.csv,
> path-overlapping-patterns.ipynb
>
>
> Story
> As a data scientist, I want to be able to define multiple symbols that result
> in overlapping partitions.
> See
> http://madlib.incubator.apache.org/docs/latest/group__grp__path.html
> for a description of what a symbol is.
> Currently in 1.9, overlapping partitions are not supported. The default is
> non-overlapping, where the path algo begins the next pattern search at the
> row that follows the last pattern match (like how grep works in UNIX).
> In the case of overlapping, the path algo needs to find every occurrence of
> the pattern in the partition, regardless of whether it might have been part
> of a previously found match. This means one row can match multiple symbols in
> a given matched pattern so there is a dependency on
> https://issues.apache.org/jira/browse/MADLIB-943 . There is (small) chance
> that this story is a no-op once
> https://issues.apache.org/jira/browse/MADLIB-943 is done.
> Need to add a new optional BOOLEAN parameter to the interface called
> "overlapping_patterns". Default is FALSE.
> (While you are at it please fix the docs to indicate that the "persist_rows"
> param is optional with default FALSE.)
> Acceptance
> The attached data set and query should should produce the following output:
> Event Timestamp User ID Age Group Income Group Gender Region
> Household Size Click Event Purchase Event Revenue Margin Match ID
> 4/15/12 7:02 100821 1 4 Unknown West 3 1 1
> 118 39 1
> 4/15/12 8:51 102201 3 3 Female East 3 0 0
> 0 0 1
> 4/15/12 9:28 101121 2 2 Unknown West 4 1 1
> 103 32 1,2
> 4/15/12 10:19 103711 4 3 Female Central 5 0 0
> 0 0 2
> 4/15/12 11:40 100821 1 4 Unknown West 3 0 0
> 0 0 2
> 4/16/12 2:12 100821 1 4 Unknown West 3 1 1
> 153 26 3
> 4/16/12 4:20 102201 3 3 Female East 3 0 0
> 0 0 3
> 4/16/12 5:38 101121 2 2 Unknown West 4 1 0
> 0 0 3
> 4/16/12 20:46 101121 2 2 Unknown West 4 1 1
> 131 28 4
> 4/16/12 21:11 101331 2 4 Female East 5 1 1
> 127 27 4
> 4/16/12 22:35 101121 2 2 Unknown West 4 0 0
> 0 0 4
> There are 4 pattern matches. The 1st and the 2nd overlap.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)