yuhao yang created SPARK-20114:
----------------------------------
Summary: spark.ml parity for sequential pattern mining - PrefixSpan
Key: SPARK-20114
URL: https://issues.apache.org/jira/browse/SPARK-20114
Project: Spark
Issue Type: New Feature
Components: ML
Affects Versions: 2.2.0
Reporter: yuhao yang
Creating this jira to track the feature parity for PrefixSpan and sequential
pattern mining in Spark ml with DataFrame API.
First list a few design issues to be discussed, then subtasks like Scala,
Python and R will be created.
# Wrapping the MLlib PrefixSpan and provide a generic fit() should be
straightforward. Yet PrefixSpan only extracts frequent sequential patterns,
which is not good to be used directly for predicting on new records. Please
read
http://data-mining.philippe-fournier-viger.com/introduction-to-sequential-rule-mining/
for some background knowledge. Thanks Philippe Fournier-Viger for providing
insights. If we want to keep using the Estimator/Transformer pattern, options
are:
#* Implement a dummy transform for PrefixSpanModel, which will not add
new column to the input DataSet.
#* Adding the feature to extract sequential rules from sequential
patterns. Then use the sequential rules in the transform as FPGrowthModel. The
rules extracted are of the form X–> Y where X and Y are sequential patterns.
But in practice, these rules are not very good as they are too precise and thus
not noise tolerant.
# Different from association rules and frequent itemsets, sequential rules can
be extracted from the original dataset more efficiently using algorithms like
RuleGrowth, ERMiner. The rules are X–> Y where X is unordered and Y is
unordered, but X must appear before Y, which is more general and can work
better in practice for prediction.
I'd like to hear more from the users to see which kind of Sequential rules are
more practical.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]