Cyril de Vogelaere created SPARK-20324:
------------------------------------------

             Summary: Control itemSets length in PrefixSpan
                 Key: SPARK-20324
                 URL: https://issues.apache.org/jira/browse/SPARK-20324
             Project: Spark
          Issue Type: Improvement
          Components: MLlib
    Affects Versions: 2.1.0
            Reporter: Cyril de Vogelaere
            Priority: Minor


The idea behind this improvement would be to allow better control over the size 
of itemSets in solution patterns.

For example, assuming you posses a huge dataset of series product bought 
together, one sequence per client. And you want to find item frequently bough 
in pairs, as to make interesting promotions to your client or boost certains 
sales.

In the current implementation, all solutions would have to be calculated, 
before the user can sort through them and select only interesting ones.

What i'm proposing here, is the addition of two parameters : 

First, a maxItemPerItemset parameter which would limit the maximum number of 
item per itemset to a certain size X. Allowing potential important reduction in 
the search space, hastening the process of finding theses specific solutions.

Second a tandem minItemPerItemset parameter  that would limit the minimum 
number of item per itemset. Discarding solution that do not fit this 
constraint. Although this wouldn't entail a reduction of the constraint, this 
should still allow interested user to reduce the number of solutions collected 
by the driver.

If this solution seems interesting to the community, I will implement a 
solution along with test to guarantee the correcteness of it's implementation.





--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to