Cyril de Vogelaere created SPARK-20324:
------------------------------------------
Summary: Control itemSets length in PrefixSpan
Key: SPARK-20324
URL: https://issues.apache.org/jira/browse/SPARK-20324
Project: Spark
Issue Type: Improvement
Components: MLlib
Affects Versions: 2.1.0
Reporter: Cyril de Vogelaere
Priority: Minor
The idea behind this improvement would be to allow better control over the size
of itemSets in solution patterns.
For example, assuming you posses a huge dataset of series product bought
together, one sequence per client. And you want to find item frequently bough
in pairs, as to make interesting promotions to your client or boost certains
sales.
In the current implementation, all solutions would have to be calculated,
before the user can sort through them and select only interesting ones.
What i'm proposing here, is the addition of two parameters :
First, a maxItemPerItemset parameter which would limit the maximum number of
item per itemset to a certain size X. Allowing potential important reduction in
the search space, hastening the process of finding theses specific solutions.
Second a tandem minItemPerItemset parameter that would limit the minimum
number of item per itemset. Discarding solution that do not fit this
constraint. Although this wouldn't entail a reduction of the constraint, this
should still allow interested user to reduce the number of solutions collected
by the driver.
If this solution seems interesting to the community, I will implement a
solution along with test to guarantee the correcteness of it's implementation.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]