[
https://issues.apache.org/jira/browse/SPARK-20179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sean Owen resolved SPARK-20179.
-------------------------------
Resolution: Duplicate
> Major improvements to Spark's Prefix span
> -----------------------------------------
>
> Key: SPARK-20179
> URL: https://issues.apache.org/jira/browse/SPARK-20179
> Project: Spark
> Issue Type: Improvement
> Components: MLlib
> Affects Versions: 2.1.0
> Environment: All
> Reporter: Cyril de Vogelaere
> Original Estimate: 0h
> Remaining Estimate: 0h
>
> The code I would like to push allows major performances improvement for
> single-item patterns through the use of a CP based solver (Namely OscaR, an
> open-source solver). And slight perfomance improved for multi-item patterns.
> As you can see in the log scale graph reachable from the link below, the
> performance are at worse roughly the same but at best, up to 50x faster (FIFA
> dataset)
> Link to graph : http://i67.tinypic.com/t06lw7.jpg
> Link to implementation : https://github.com/Syrux/spark
> Link for datasets :
> http://www.philippe-fournier-viger.com/spmf/index.php?link=datasets.php (the
> two slen datasets used are the first two in the list of 9)
> Performances were tested on the CECI servers, providing the driver with 10G
> memory (more than needed) and running 4 slaves.
> Additionnally to the performance improvements. I also added a bunch of new
> fonctionnalities :
> - Unlimited Max pattern length (with input 0) : No improvement in
> perfomance here, simply allows the use of unlimited max pattern length.
> - Min pattern length : Any item below that length won't be outputted. No
> improvement in performance, just a new functionnality.
> - Max Item per itemset : An itemset won't be grown further than the inputed
> number, thus reducing the search space.
> - Head start : During the initial dataset cleaning, the frequent item were
> found then discarded. Which resulted in a inefficient first iteration of the
> genFreqPattern method. The algorithm new uses them if they are provided, and
> uses the empty pattern in case they're not. Slight improvement of
> performances were found.
> - Sub-problem limit : When resulting item sequence can be very long and the
> user disposes of a small number of very powerfull machine, this parameter
> will allow a quick switch to local execution. Tremendously improving
> performances. Outside of those conditions, the performance may be the
> negatively affected.
> - Item constraints : Allow the user to specify constraint on the occurences
> of an item (=, >, <, >=, <=, !=). For user who are looking for some specific
> results, the search space can be greatly reduced. Which also improve
> performances.
> Of course, all of theses added feature where tested for correctness. As you
> can see on the github link.
> Please take note that the afformentionned fonctionnalities didn't come into
> play when testing the performance. The performance shown on the graph are
> 'merely' the result of the remplacement of the local execution by a CP based
> algorithm. (maxLocalProjDBSize was also kept to it's default value
> (32000000L), to make sure the performance couldn't be artificially improved)
> Trade offs :
> - The algorithm does have a slight trade off. Since the algorithm needs to
> detect whether sequences can use the specialised single-item patterns CP
> algorithm. It may be a bit slower when it is not needed. This trade of was
> mitigated by introducing a round sequence cleaning before the local
> execution. Thus improving the performance of multi-item local executions when
> the cleaning is effective. In case the no item can be cleaned, the check will
> appear in the performence, creating a slight drop in them.
> - All other change provided shouldn't have any effect on efficiency or
> complexity if left to their default value. (Where they are basically
> desactivated). When activated, they may however reduce the search space and
> thus improve performance.
> Additionnal note :
> - The performance improvement are mostly seen far datasets where all itemset
> are of size one, since it is a necessary condition for the use of the CP
> based algorithm. But as you can see in the two Slen datasets, performances
> were also slightly improved for algorithms which have multiple items per
> itemset. The algorithm was built to detect when CP can be used, and to use
> it, given the opportunity.
> - The performance displayed here are the results of six months of work.
> Various other things were tested to improve performance, without as much
> success. I can thus say with a bit of confidence that the performances here
> attained will be very hard to improve further.
> - In case you want me to test the performance on a specific dataset or to
> provide additionnal informations, it would be my pleasure to do so :). Just
> it me up with my email below :
> Email : [email protected]
> - I am a newbie contributor to Spark, and am not familliar with the whole
> procedure at all. In case, I did something incorrectly, I will fix it as soon
> as possible.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]