[GitHub] spark pull request: [SPARK-5927][MLlib] Modify FPGrowth's partitio...

viirya Sat, 21 Feb 2015 02:09:07 -0800

Github user viirya commented on the pull request:

    https://github.com/apache/spark/pull/4706#issuecomment-75365051
  
    Currently, since we are using HashPartitioner and item rankings to decide 
the partition, you get a very even distribution. When the given partition 
number is less than the number of items, this distribution is inefficient.
    
    For example, assume we have the transactions including 9 items as:
    
          "r z h k p a b d e"
          "z y x w v u t s a b c"
          "s x o n r a c i"
          "x z y m t s q e b c"
          "z a u"
          "x z y r q t p a n m"
    
    As we use 2 partitions, the current implementation generates such 
partitions:
    
        Map(1 -> List(0, 2, 3, 4, 6, 7), 0 -> List(0, 2, 3, 4, 6, 7, 8))
        Map(1 -> List(0, 1, 4, 5), 0 -> List(0, 1, 4))
        Map(1 -> List(0, 1, 2, 3, 4, 6, 7), 0 -> List(0, 1, 2, 3, 4, 6, 7, 8))
        Map(1 -> List(0, 1), 0 -> List(0))
        Map(1 -> List(1, 2, 5), 0 -> List(1, 2, 5, 6, 8))
        Map(1 -> List(0, 1, 2, 3, 5, 7), 0 -> List(0, 1, 2))
    
    For the transaction `List(0, 1, 2, 3, 4, 6, 7, 8)` and `List(0, 2, 3, 4, 6, 
7, 8)`, two partition almost copy same items.
    
    With this pr:
    
        Map(1 -> List(0, 1, 4, 5), 0 -> List(0, 1, 4))
        Map(1 -> List(0, 2, 3, 4, 6, 7, 8), 0 -> List(0, 2, 3, 4))
        Map(1 -> List(0, 1, 2, 3, 4, 6, 7, 8), 0 -> List(0, 1, 2, 3, 4))
        Map(0 -> List(0, 1))
        Map(1 -> List(0, 1, 2, 3, 5, 7), 0 -> List(0, 1, 2, 3))
        Map(1 -> List(1, 2, 5, 6, 8), 0 -> List(1, 2)
    
    Now `List(0, 1, 2, 3, 4, 6, 7, 8)` generates two partitions `List(0, 1, 2, 
3, 4, 6, 7, 8)` and `List(0, 1, 2, 3, 4)`.  `List(0, 2, 3, 4, 6, 7, 8)` 
generates `List(0, 2, 3, 4, 6, 7, 8)` and `List(0, 2, 3, 4)`.
    
    Then for the partition 0, it has less items to build its prefix tree. For 
the partition 1, it has the same items to build its tree as before.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-5927][MLlib] Modify FPGrowth's partitio...

Reply via email to