[ https://issues.apache.org/jira/browse/SPARK-20179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Cyril de Vogelaere updated SPARK-20179: --------------------------------------- Priority: Major (was: Minor) > Major improvements to Spark's Prefix span > ----------------------------------------- > > Key: SPARK-20179 > URL: https://issues.apache.org/jira/browse/SPARK-20179 > Project: Spark > Issue Type: Improvement > Components: MLlib > Affects Versions: 2.1.0 > Environment: All > Reporter: Cyril de Vogelaere > Original Estimate: 0h > Remaining Estimate: 0h > > The code I would like to push allows major performances improvement for > single-item patterns through the use of a CP based solver (Namely OscaR, an > open-source solver). And slight perfomance improved for multi-item patterns. > As you can see in the log scale graph reachable from the link below, the > performance are at worse roughly the same but at best, up to 50x faster (FIFA > dataset) > Link to graph : http://i67.tinypic.com/t06lw7.jpg > Link to implementation : https://github.com/Syrux/spark > Link for datasets : > http://www.philippe-fournier-viger.com/spmf/index.php?link=datasets.php (the > two slen datasets used are the first two in the list of 9) > Performances were tested on the CECI servers, providing the driver with 10G > memory (more than needed) and running 4 slaves. > Additionnally to the performance improvements. I also added a bunch of new > fonctionnalities : > - Unlimited Max pattern length (with input 0) : > - Min pattern length : Any item below that length won't be outputted > - Max Item per itemset : An itemset won't be grown further than the input > number, thus reducing the search space. > - Head start : During the initial dataset cleaning, the frequent item were > found then discarded. Which resulted in a inefficient first iteration of the > genFreqPattern method. The algorithm new uses them if they are provided, and > uses the empty pattern in case they're not. Slight improvement of > performances were found. > - Sub-problem limit : When resulting item sequence can be very long and the > user disposes of a small number of very powerfull machine, this parameter > will allow a quick switch to local execution. Tremendously improving > performances. Outside of those conditions, the performance may be the > negatively affected. > - Item constraints : Allow the user to specify constraint on the occurences > of an item (=, >, <, >=, <=, !=). For user who are looking for some specific > results, the search space can be greatly reduced. Which also improve > performances. > Please take note that the afformentionned fonctionnalities didn't come into > play when testing the performance. The performance shown on the graph are > 'merely' the result of the remplacement of the local execution by a CP based > algorithm. (maxLocalProjDBSize was also kept to it's default value > (32000000L), to make sure the performance couldn't be artificially improved) > Trade offs : > - The algorithm does have a slight trade off. Since the algorithm needs to > detect whether sequences can use the specialised single-item patterns CP > algorithm. It may be a bit slower when it is not needed. This trade of was > mitigated by introducing a round sequence cleaning before the local > execution. Thus improving the performance of multi-item local executions when > the cleaning is effective. In case the no item can be cleaned, the check will > appear in the performence, creating a slight drop in them. > - All other change provided shouldn't have any effect on efficiency or > complexity if left to their default value. (Where they are basically > desactivated). When activated, they may however reduce the search space and > thus improve performance. > Additionnal note : > - The performance improvement are mostly seen far datasets where all itemset > are of size one, since it is a necessary condition for the use of the CP > based algorithm. But as you can see in the two Slen datasets, performances > were also slightly improved for algorithms which have multiple items per > itemset. The algorithm was built to detect when CP can be used, and to use > it, given the opportunity. > - The performance displayed here are the results of six months of work. > Various other things were tested to improve performance, without as much > success. I can thus say with a bit of confidence that the performances here > attained will be very hard to improve further. > - In case you want me to test the performance on a specific dataset or to > provide additionnal informations, it would be my pleasure to do so :). Just > it me up with my email below : > Email : cyril.devogela...@gmail.com > - I am a newbie contributor to Spark, and am not familliar with the whole > procedure at all. In case, I did something incorrectly, I will fix it as soon > as possible. -- This message was sent by Atlassian JIRA (v6.3.15#6346) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org