Github user feynmanliang commented on the pull request:
https://github.com/apache/spark/pull/7412#issuecomment-121762995
My concern is not projected database > # patterns, rather it is that the
`groupByKey` on L103 will overload an executor if some key (prefix) has many
values (suffixes) associated to it.
For example, if you had `minPatternsBeforeShuffle` frequent length-one
items and a single length-two item (call it *AB*) which appeared in every
transaction, then L93-99 would terminate and L103 would try to send the suffix
of every sequence which contained *AB* to the executor assigned to *AB*. Since
*AB* appears in every transaction, the entire dataset will be sent to a single
executor.
To get around this, you can use `aggregateByKey` with a counter accumulator
to count the number of suffixes associated with each prefix (this will take a
single pass over `prefixSuffixPairs`) and begin local processing when
`accumulator.values.max` is less than some threshold the executor is capable of
handling.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]