Github user feynmanliang commented on the pull request:

    https://github.com/apache/spark/pull/7412#issuecomment-122104426
  
    The example you've given doesn't rule out my concern; if any one of those 5 
prefixes assigned to an executor has too many suffixes then the executor will 
be overloaded.
    
    For example, assume number of length-1 frequent items is >= 
minPatternsBeforeShuffle, and one of those frequent items (call it item A) 
appears in every transaction. After the `groupByKey`, the executor assigned to 
item A will receive a suffix from every transaction in the dataset. Since we 
don't assume the dataset fits on a single machine, this executor will be 
overloaded.
    
    I don't really understand the diagram you've drawn. As the prefix length 
increases, the number of suffixes associated with that prefix will always 
decrease (Lemma 3.2 and 3.2 
    "Projected databases keep shrinking" in PrefixSpan paper), so the graph 
should be monotonically decreasing.
    
    numPatterns >= minPatternsBeforeLocalProcessing is a heuristic; if there 
are more patterns then we expect each pattern to be longer and since the 
pattern length (== prefix length) is longer by Lemma 3.2 we expect less 
suffixes associated with that prefix. However, this is only a heuristic whereas 
checking if maxNumSuffixesForAnyPrefix <= threshold will guarantee that the 
algorithm will terminate without crashing.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to