Github user feynmanliang commented on the pull request:
https://github.com/apache/spark/pull/7412#issuecomment-122104426
The example you've given doesn't rule out my concern; if any one of those 5
prefixes assigned to an executor has too many suffixes then the executor will
be overloaded.
For example, assume number of length-1 frequent items is >=
minPatternsBeforeShuffle, and one of those frequent items (call it item A)
appears in every transaction. After the `groupByKey`, the executor assigned to
item A will receive a suffix from every transaction in the dataset. Since we
don't assume the dataset fits on a single machine, this executor will be
overloaded.
I don't really understand the diagram you've drawn. As the prefix length
increases, the number of suffixes associated with that prefix will always
decrease (Lemma 3.2 and 3.2
"Projected databases keep shrinking" in PrefixSpan paper), so the graph
should be monotonically decreasing.
numPatterns >= minPatternsBeforeLocalProcessing is a heuristic; if there
are more patterns then we expect each pattern to be longer and since the
pattern length (== prefix length) is longer by Lemma 3.2 we expect less
suffixes associated with that prefix. However, this is only a heuristic whereas
checking if maxNumSuffixesForAnyPrefix <= threshold will guarantee that the
algorithm will terminate without crashing.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]