zhangyouhua created SPARK-6381:
----------------------------------
Summary: add Apriori algorithm to MLLib
Key: SPARK-6381
URL: https://issues.apache.org/jira/browse/SPARK-6381
Project: Spark
Issue Type: New Feature
Components: MLlib
Affects Versions: 1.3.1
Reporter: zhangyouhua
Fix For: 1.4.0
[~mengxr]
There are many algorithms about association rule mining,for example FPGrowth,
Apriori and so on.these algorithms are classic
algorithms in machine learning, and there are very much usefully in big data
mining. Even the FPGrowth algorithm in spark
1.3 version have implementation to solution big big data set, but it need
create FPTree before mining frequent item. so
while transition data is smaller and the data is sparse and minSupport is
bigger,wen can select Apriori algorithms.
how Apriori algorithm parallelism?
1.Generates frequent items by filtering the input data using minimal support
level.
private def genFreqItems[Item: ClassTag]( data: RDD[Array[Item]],minCount:
Long,partitioner: Partitioner): Array[Item]
2.Generate frequent itemSets by building apriori, the extraction is done on
each partition.
2.1 create candidateSet by kFreqItems and k
private def createCandidateSet[Item: ClassTag]( kFreqItems:
Array[(Array[Item], Long)], k: Int)
2.2 create kFreqItems from candidateSet is generated by candidateSet
private def scanDataSet[Item: ClassTag](dataSet:
RDD[Array[Item]],candidateSet: Array[Array[Item]], minCount: Double):
RDD[(Array[Item], Long)]
2.3 filter dataSet by candidateSet.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]