Cyril de Vogelaere created SPARK-20179:
------------------------------------------
Summary: Major improvements to Spark's Prefix span
Key: SPARK-20179
URL: https://issues.apache.org/jira/browse/SPARK-20179
Project: Spark
Issue Type: Improvement
Components: MLlib
Affects Versions: 2.1.0
Environment: All
Reporter: Cyril de Vogelaere
The code I would like to push allows major performances improvement through the
use of a CP based solver (Namely OscaR, an open-source solver). As you can see
in the log scale graph reachable from the link below, the performance are at
worse roughly the same but at best, up to 50x faster (FIFA dataset)
Link to graph : http://i67.tinypic.com/t06lw7.jpg
Link to implementation : https://github.com/Syrux/spark
Link for datasets :
http://www.philippe-fournier-viger.com/spmf/index.php?link=datasets.php (the
two slen datasets used are the first two in the list of 9)
Performances were tested on the CECI servers, providing the driver with 10G
memory (more than needed) and running 4 slaves.
Additionnally to the performance improvements. I also added a bunch of new
fonctionnalities :
- Unlimited Max pattern length (with input 0) :
- Min pattern length : Any item below that length won't be outputted
- Max Item per itemset : An itemset won't be grown further than the input
number, thus reducing the search space.
- Head start : During the initial dataset cleaning, the frequent item were
found then discarded. Which resulted in a inefficient first iteration of the
genFreqPattern method. The algorithm new uses them if they are provided, and
uses the empty pattern in case they're not. Slight improvement of performances
were found.
- Sub-problem limit : When resulting item sequence can be very long and the
user disposes of a small number of very powerfull machine, this parameter will
allow a quick switch to local execution. Tremendously improving performances.
Outside of those conditions, the performance may be the negatively affected.
- Item constraints : Allow the user to specify constraint on the occurences
of an item (=, >, <, >=, <=, !=). For user who are looking for some specific
results, the search space can be greatly reduced. Which also improve
performances.
Please take note that the afformentionned fonctionnalities didn't come into
play when testing the performance. The performance shown on the graph are
'merely' the result of the remplacement of the local execution by a CP based
algorithm. (maxLocalProjDBSize was also kept to it's default value (32000000L),
to make sure the performance couldn't be artificially improved)
Additionnal note :
- The performance improvement are mostly seen far datasets where all itemset
are of size one, since it is a necessary condition for the use of the CP based
algorithm. But as you can see in the two Slen datasets, performances were also
slightly improved for algorithms which have multiple items per itemset. The
algorithm was built to detect when CP can be used, and to use it, given the
opportunity.
- The performance displayed here are the results of six months of work.
Various other things were tested to improve performance, without as much
success. I can thus say with a bit of confidence that the performances here
attained will be very hard to improve further.
- In case you want me to test the performance on a specific dataset or to
provide additionnal informations, it would be my pleasure to do so :). Just it
me up with my email below :
Email : [email protected]
- I am a newbie contributor to Spark, and am not familliar with the whole
precedure at all. In case, I did something incorrectly, I will fix it as soon
as possible.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]