[ 
https://issues.apache.org/jira/browse/SPARK-20179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cyril de Vogelaere updated SPARK-20179:
---------------------------------------
    Description: 
The code I would like to push allows major performances improvement for 
single-item patterns through the use of a CP based solver (Namely OscaR, an 
open-source solver). And slight perfomance improved for multi-item patterns. As 
you can see in the log scale graph reachable from the link below, the 
performance are at worse roughly the same but at best, up to 50x faster (FIFA 
dataset)

Link to graph : http://i67.tinypic.com/t06lw7.jpg
Link to implementation : https://github.com/Syrux/spark
Link for datasets : 
http://www.philippe-fournier-viger.com/spmf/index.php?link=datasets.php (the 
two slen datasets used are the first two in the list of 9)

Performances were tested on the CECI servers, providing the driver with 10G 
memory (more than needed) and running 4 slaves.

Additionnally to the performance improvements. I also added a bunch of new 
fonctionnalities :

  - Unlimited Max pattern length (with input 0) : No improvement in perfomance 
here, simply allows the use of unlimited max pattern length.

  - Min pattern length : Any item below that length won't be outputted. No 
improvement in performance, just a new functionnality.

  - Max Item per itemset : An itemset won't be grown further than the inputed 
number, thus reducing the search space. 

  - Head start : During the initial dataset cleaning, the frequent item were 
found then discarded. Which resulted in a inefficient first iteration of the 
genFreqPattern method. The algorithm new uses them if they are provided, and 
uses the empty pattern in case they're not. Slight improvement of performances 
were found.

  - Sub-problem limit : When resulting item sequence can be very long and the 
user disposes of a small number of very powerfull machine, this parameter will 
allow a quick switch to local execution. Tremendously improving performances. 
Outside of those conditions, the performance may be the negatively affected.

  - Item constraints : Allow the user to specify constraint on the occurences 
of an item (=, >, <, >=, <=, !=). For user who are looking for some specific 
results, the search space can be greatly reduced. Which also improve 
performances.


Please take note that the afformentionned fonctionnalities didn't come into 
play when testing the performance. The performance shown on the graph are 
'merely' the result of the remplacement of the local execution by a CP based 
algorithm. (maxLocalProjDBSize was also kept to it's default value (32000000L), 
to make sure the performance couldn't be artificially improved)

Trade offs :

  - The algorithm does have a slight trade off. Since the algorithm needs to 
detect whether sequences can use the specialised single-item patterns CP 
algorithm. It may be a bit slower when it is not needed. This trade of was 
mitigated by introducing a round sequence cleaning before the local execution. 
Thus improving the performance of multi-item local executions when the cleaning 
is effective. In case the no item can be cleaned, the check will appear in the 
performence, creating a slight drop in them.

  - All other change provided shouldn't have any effect on efficiency or 
complexity if left to their default value. (Where they are basically 
desactivated). When activated, they may however reduce the search space and 
thus improve performance.

Additionnal note :

 - The performance improvement are mostly seen far datasets where all itemset 
are of size one, since it is a necessary condition for the use of the CP based 
algorithm. But as you can see in the two Slen datasets, performances were also 
slightly improved for algorithms which have multiple items per itemset. The 
algorithm was built to detect when CP can be used, and to use it, given the 
opportunity.

 - The performance displayed here are the results of six months of work. 
Various other things were tested to improve performance, without as much 
success. I can thus say with a bit of confidence that the performances here 
attained will be very hard to improve further.

 - In case you want me to test the performance on a specific dataset or to 
provide additionnal informations, it would be my pleasure to do so :). Just it 
me up with my email below :

Email : cyril.devogela...@gmail.com

 - I am a newbie contributor to Spark, and am not familliar with the whole 
procedure at all. In case, I did something incorrectly, I will fix it as soon 
as possible.


  was:
The code I would like to push allows major performances improvement for 
single-item patterns through the use of a CP based solver (Namely OscaR, an 
open-source solver). And slight perfomance improved for multi-item patterns. As 
you can see in the log scale graph reachable from the link below, the 
performance are at worse roughly the same but at best, up to 50x faster (FIFA 
dataset)

Link to graph : http://i67.tinypic.com/t06lw7.jpg
Link to implementation : https://github.com/Syrux/spark
Link for datasets : 
http://www.philippe-fournier-viger.com/spmf/index.php?link=datasets.php (the 
two slen datasets used are the first two in the list of 9)

Performances were tested on the CECI servers, providing the driver with 10G 
memory (more than needed) and running 4 slaves.

Additionnally to the performance improvements. I also added a bunch of new 
fonctionnalities :

  - Unlimited Max pattern length (with input 0) : 

  - Min pattern length : Any item below that length won't be outputted

  - Max Item per itemset : An itemset won't be grown further than the input 
number, thus reducing the search space. 

  - Head start : During the initial dataset cleaning, the frequent item were 
found then discarded. Which resulted in a inefficient first iteration of the 
genFreqPattern method. The algorithm new uses them if they are provided, and 
uses the empty pattern in case they're not. Slight improvement of performances 
were found.

  - Sub-problem limit : When resulting item sequence can be very long and the 
user disposes of a small number of very powerfull machine, this parameter will 
allow a quick switch to local execution. Tremendously improving performances. 
Outside of those conditions, the performance may be the negatively affected.

  - Item constraints : Allow the user to specify constraint on the occurences 
of an item (=, >, <, >=, <=, !=). For user who are looking for some specific 
results, the search space can be greatly reduced. Which also improve 
performances.


Please take note that the afformentionned fonctionnalities didn't come into 
play when testing the performance. The performance shown on the graph are 
'merely' the result of the remplacement of the local execution by a CP based 
algorithm. (maxLocalProjDBSize was also kept to it's default value (32000000L), 
to make sure the performance couldn't be artificially improved)

Trade offs :

  - The algorithm does have a slight trade off. Since the algorithm needs to 
detect whether sequences can use the specialised single-item patterns CP 
algorithm. It may be a bit slower when it is not needed. This trade of was 
mitigated by introducing a round sequence cleaning before the local execution. 
Thus improving the performance of multi-item local executions when the cleaning 
is effective. In case the no item can be cleaned, the check will appear in the 
performence, creating a slight drop in them.

  - All other change provided shouldn't have any effect on efficiency or 
complexity if left to their default value. (Where they are basically 
desactivated). When activated, they may however reduce the search space and 
thus improve performance.

Additionnal note :

 - The performance improvement are mostly seen far datasets where all itemset 
are of size one, since it is a necessary condition for the use of the CP based 
algorithm. But as you can see in the two Slen datasets, performances were also 
slightly improved for algorithms which have multiple items per itemset. The 
algorithm was built to detect when CP can be used, and to use it, given the 
opportunity.

 - The performance displayed here are the results of six months of work. 
Various other things were tested to improve performance, without as much 
success. I can thus say with a bit of confidence that the performances here 
attained will be very hard to improve further.

 - In case you want me to test the performance on a specific dataset or to 
provide additionnal informations, it would be my pleasure to do so :). Just it 
me up with my email below :

Email : cyril.devogela...@gmail.com

 - I am a newbie contributor to Spark, and am not familliar with the whole 
procedure at all. In case, I did something incorrectly, I will fix it as soon 
as possible.



> Major improvements to Spark's Prefix span
> -----------------------------------------
>
>                 Key: SPARK-20179
>                 URL: https://issues.apache.org/jira/browse/SPARK-20179
>             Project: Spark
>          Issue Type: Improvement
>          Components: MLlib
>    Affects Versions: 2.1.0
>         Environment: All
>            Reporter: Cyril de Vogelaere
>   Original Estimate: 0h
>  Remaining Estimate: 0h
>
> The code I would like to push allows major performances improvement for 
> single-item patterns through the use of a CP based solver (Namely OscaR, an 
> open-source solver). And slight perfomance improved for multi-item patterns. 
> As you can see in the log scale graph reachable from the link below, the 
> performance are at worse roughly the same but at best, up to 50x faster (FIFA 
> dataset)
> Link to graph : http://i67.tinypic.com/t06lw7.jpg
> Link to implementation : https://github.com/Syrux/spark
> Link for datasets : 
> http://www.philippe-fournier-viger.com/spmf/index.php?link=datasets.php (the 
> two slen datasets used are the first two in the list of 9)
> Performances were tested on the CECI servers, providing the driver with 10G 
> memory (more than needed) and running 4 slaves.
> Additionnally to the performance improvements. I also added a bunch of new 
> fonctionnalities :
>   - Unlimited Max pattern length (with input 0) : No improvement in 
> perfomance here, simply allows the use of unlimited max pattern length.
>   - Min pattern length : Any item below that length won't be outputted. No 
> improvement in performance, just a new functionnality.
>   - Max Item per itemset : An itemset won't be grown further than the inputed 
> number, thus reducing the search space. 
>   - Head start : During the initial dataset cleaning, the frequent item were 
> found then discarded. Which resulted in a inefficient first iteration of the 
> genFreqPattern method. The algorithm new uses them if they are provided, and 
> uses the empty pattern in case they're not. Slight improvement of 
> performances were found.
>   - Sub-problem limit : When resulting item sequence can be very long and the 
> user disposes of a small number of very powerfull machine, this parameter 
> will allow a quick switch to local execution. Tremendously improving 
> performances. Outside of those conditions, the performance may be the 
> negatively affected.
>   - Item constraints : Allow the user to specify constraint on the occurences 
> of an item (=, >, <, >=, <=, !=). For user who are looking for some specific 
> results, the search space can be greatly reduced. Which also improve 
> performances.
> Please take note that the afformentionned fonctionnalities didn't come into 
> play when testing the performance. The performance shown on the graph are 
> 'merely' the result of the remplacement of the local execution by a CP based 
> algorithm. (maxLocalProjDBSize was also kept to it's default value 
> (32000000L), to make sure the performance couldn't be artificially improved)
> Trade offs :
>   - The algorithm does have a slight trade off. Since the algorithm needs to 
> detect whether sequences can use the specialised single-item patterns CP 
> algorithm. It may be a bit slower when it is not needed. This trade of was 
> mitigated by introducing a round sequence cleaning before the local 
> execution. Thus improving the performance of multi-item local executions when 
> the cleaning is effective. In case the no item can be cleaned, the check will 
> appear in the performence, creating a slight drop in them.
>   - All other change provided shouldn't have any effect on efficiency or 
> complexity if left to their default value. (Where they are basically 
> desactivated). When activated, they may however reduce the search space and 
> thus improve performance.
> Additionnal note :
>  - The performance improvement are mostly seen far datasets where all itemset 
> are of size one, since it is a necessary condition for the use of the CP 
> based algorithm. But as you can see in the two Slen datasets, performances 
> were also slightly improved for algorithms which have multiple items per 
> itemset. The algorithm was built to detect when CP can be used, and to use 
> it, given the opportunity.
>  - The performance displayed here are the results of six months of work. 
> Various other things were tested to improve performance, without as much 
> success. I can thus say with a bit of confidence that the performances here 
> attained will be very hard to improve further.
>  - In case you want me to test the performance on a specific dataset or to 
> provide additionnal informations, it would be my pleasure to do so :). Just 
> it me up with my email below :
> Email : cyril.devogela...@gmail.com
>  - I am a newbie contributor to Spark, and am not familliar with the whole 
> procedure at all. In case, I did something incorrectly, I will fix it as soon 
> as possible.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to