[jira] [Commented] (SPARK-20179) Major improvements to Spark's Prefix span

Cyril de Vogelaere (JIRA) Fri, 31 Mar 2017 08:01:25 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-20179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15951077#comment-15951077
 ]


Cyril de Vogelaere commented on SPARK-20179:
--------------------------------------------

Hello Sean,

I did have a look at the contributing page,
I don't really understand what you mean exatcly. Do you mean I should go over 
every bit of code I changed ? 
Because that may take a while ^^'

For trade offs in performances, I wil add them to the ticket. Starting right now

> Major improvements to Spark's Prefix span
> -----------------------------------------
>
>                 Key: SPARK-20179
>                 URL: https://issues.apache.org/jira/browse/SPARK-20179
>             Project: Spark
>          Issue Type: Improvement
>          Components: MLlib
>    Affects Versions: 2.1.0
>         Environment: All
>            Reporter: Cyril de Vogelaere
>            Priority: Minor
>   Original Estimate: 0h
>  Remaining Estimate: 0h
>
> The code I would like to push allows major performances improvement through 
> the use of a CP based solver (Namely OscaR, an open-source solver). As you 
> can see in the log scale graph reachable from the link below, the performance 
> are at worse roughly the same but at best, up to 50x faster (FIFA dataset)
> Link to graph : http://i67.tinypic.com/t06lw7.jpg
> Link to implementation : https://github.com/Syrux/spark
> Link for datasets : 
> http://www.philippe-fournier-viger.com/spmf/index.php?link=datasets.php (the 
> two slen datasets used are the first two in the list of 9)
> Performances were tested on the CECI servers, providing the driver with 10G 
> memory (more than needed) and running 4 slaves.
> Additionnally to the performance improvements. I also added a bunch of new 
> fonctionnalities :
>   - Unlimited Max pattern length (with input 0) : 
>   - Min pattern length : Any item below that length won't be outputted
>   - Max Item per itemset : An itemset won't be grown further than the input 
> number, thus reducing the search space. 
>   - Head start : During the initial dataset cleaning, the frequent item were 
> found then discarded. Which resulted in a inefficient first iteration of the 
> genFreqPattern method. The algorithm new uses them if they are provided, and 
> uses the empty pattern in case they're not. Slight improvement of 
> performances were found.
>   - Sub-problem limit : When resulting item sequence can be very long and the 
> user disposes of a small number of very powerfull machine, this parameter 
> will allow a quick switch to local execution. Tremendously improving 
> performances. Outside of those conditions, the performance may be the 
> negatively affected.
>   - Item constraints : Allow the user to specify constraint on the occurences 
> of an item (=, >, <, >=, <=, !=). For user who are looking for some specific 
> results, the search space can be greatly reduced. Which also improve 
> performances.
> Please take note that the afformentionned fonctionnalities didn't come into 
> play when testing the performance. The performance shown on the graph are 
> 'merely' the result of the remplacement of the local execution by a CP based 
> algorithm. (maxLocalProjDBSize was also kept to it's default value 
> (32000000L), to make sure the performance couldn't be artificially improved)
> Additionnal note :
>  - The performance improvement are mostly seen far datasets where all itemset 
> are of size one, since it is a necessary condition for the use of the CP 
> based algorithm. But as you can see in the two Slen datasets, performances 
> were also slightly improved for algorithms which have multiple items per 
> itemset. The algorithm was built to detect when CP can be used, and to use 
> it, given the opportunity.
>  - The performance displayed here are the results of six months of work. 
> Various other things were tested to improve performance, without as much 
> success. I can thus say with a bit of confidence that the performances here 
> attained will be very hard to improve further.
>  - In case you want me to test the performance on a specific dataset or to 
> provide additionnal informations, it would be my pleasure to do so :). Just 
> it me up with my email below :
> Email : [email protected]
>  - I am a newbie contributor to Spark, and am not familliar with the whole 
> precedure at all. In case, I did something incorrectly, I will fix it as soon 
> as possible.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SPARK-20179) Major improvements to Spark's Prefix span

Reply via email to