Github user feynmanliang commented on the pull request:

    https://github.com/apache/spark/pull/7412#issuecomment-121762995
  
    My concern is not projected database > # patterns, rather it is that the 
`groupByKey` on L103 will overload an executor if some key (prefix) has many 
values (suffixes) associated to it.
    
    For example, if you had `minPatternsBeforeShuffle` frequent length-one 
items and a single length-two item (call it *AB*) which appeared in every 
transaction, then L93-99 would terminate and L103 would try to send the suffix 
of every sequence which contained *AB* to the executor assigned to *AB*. Since 
*AB* appears in every transaction, the entire dataset will be sent to a single 
executor.
    
    To get around this, you can use `aggregateByKey` with a counter accumulator 
to count the number of suffixes associated with each prefix (this will take a 
single pass over `prefixSuffixPairs`) and begin local processing when 
`accumulator.values.max` is less than some threshold the executor is capable of 
handling.
    



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to