GitHub user mengxr opened a pull request:

    https://github.com/apache/spark/pull/7937

    [SPARK-9540] [MLLIB] optimize PrefixSpan implementation

    This is a major refactoring of the PrefixSpan implementation. It contains 
the following changes:
    
    1. Expand prefix with one item at a time. The existing implementation 
generates all subsets for each itemset, which might have scalability issue when 
the itemset is large.
    2. Use a new internal format. `<(12)(31)>` is represented by `[0, 1, 2, 0, 
1, 3, 0]` internally. We use `0` because negative numbers are used to indicates 
partial prefix items, e.g., `_2` is represented by `-2`.
    3. Remember the start indices of all partial projections in the projected 
postfix to help next projection.
    4. Reuse the original sequence array for projected postfixes.
    5. Use `Prefix` IDs in aggregation rather than its content.
    6. Use `ArrayBuilder` for building primitive arrays.
    7. Expose `maxLocalProjDBSize`.
    
    `Postfix`'s API doc should be a good place to start.
    
    @feynmanliang @zhangjiajin 

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/mengxr/spark SPARK-9540

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/7937.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #7937
    
----
commit 8afc86abb668a9cee22a72eeb3b217dff3256640
Author: Xiangrui Meng <[email protected]>
Date:   2015-08-04T09:54:35Z

    refactor impl

commit bd0bd51e92aad89aefd436cb0654e43a618ff8c8
Author: Xiangrui Meng <[email protected]>
Date:   2015-08-04T17:03:37Z

    naming and documentation

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to