GitHub user mengxr opened a pull request:
https://github.com/apache/spark/pull/7937
[SPARK-9540] [MLLIB] optimize PrefixSpan implementation
This is a major refactoring of the PrefixSpan implementation. It contains
the following changes:
1. Expand prefix with one item at a time. The existing implementation
generates all subsets for each itemset, which might have scalability issue when
the itemset is large.
2. Use a new internal format. `<(12)(31)>` is represented by `[0, 1, 2, 0,
1, 3, 0]` internally. We use `0` because negative numbers are used to indicates
partial prefix items, e.g., `_2` is represented by `-2`.
3. Remember the start indices of all partial projections in the projected
postfix to help next projection.
4. Reuse the original sequence array for projected postfixes.
5. Use `Prefix` IDs in aggregation rather than its content.
6. Use `ArrayBuilder` for building primitive arrays.
7. Expose `maxLocalProjDBSize`.
`Postfix`'s API doc should be a good place to start.
@feynmanliang @zhangjiajin
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/mengxr/spark SPARK-9540
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/7937.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #7937
----
commit 8afc86abb668a9cee22a72eeb3b217dff3256640
Author: Xiangrui Meng <[email protected]>
Date: 2015-08-04T09:54:35Z
refactor impl
commit bd0bd51e92aad89aefd436cb0654e43a618ff8c8
Author: Xiangrui Meng <[email protected]>
Date: 2015-08-04T17:03:37Z
naming and documentation
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]