[jira] [Comment Edited] (SPARK-3278) Isotonic regression

Martin Zapletal (JIRA) Mon, 16 Mar 2015 02:38:10 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-3278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14362986#comment-14362986
 ]


Martin Zapletal edited comment on SPARK-3278 at 3/16/15 9:37 AM:
-----------------------------------------------------------------

Vladimir,

just to update you on the progress. I was able to complete the isotonic 
regression with 100M records (similar data to what you requested, using NASDAQ 
index prices), but failed with insufficient memory error with 150M records on 
my machine. You may be able to run with larger amounts of data on better 
machines. 

Pool adjacent violators algorithm can theoretically have linear time 
complexity, but although I have used the best algorithm I could find I am not 
convinced it reaches this efficiency. I will work on providing evidence.

The biggest issue with the current algorithm is however with the 
parallelization approach. Its properties are unfortunately nowhere near linear 
scalability (linear solution time increase with linear parallelism increase or 
constant solution time with linear parallelism increase and linear problem size 
increase). This was expected and is caused by the algorithm itself for the 
following reasons

1) The algorithm works in two steps. First the computation is distributed to 
all partitions, but then gathered and the algorithm is run again on the whole 
data set. This approach may leave most of work for the last sequential step and 
thus gaining very little compared to purely sequential implementation or even 
performing worse. That can happen in case where parallel isotonic regressions 
return a locally optimal solution that will however have to change for a global 
solution in the last step. Another performance drawback in comparison to 
sequential processing is the potential need to copy data to each process.
2) It requires the whole dataset to fit into one process’ memory in the last 
step (or repeated disk access).

I started looking into the issue and was able to design an iterative algorithm 
that adressed both the above issues and performed very close to linear 
scalability. It however still has correctness (rounding) issues and will 
require further research.

Let me know if that helped. In the meantime I will continue working on 
benchmarks and performance quantification of the current algorithm as well as 
on research for potentially more efficient solutions.


was (Author: zapletal-martin):
Vladimir,

just to update you on the progress. I was able to complete the isotonic 
regression with 100M records, but failed with insufficient memory error with 
150M records on my machine. You may be able to run with larger amounts of data 
on better machines. 

Pool adjacent violators algorithm can theoretically have linear time 
complexity, but although I have used the best algorithm I could find I am not 
convinced it reaches this efficiency. I will work on providing evidence.

The biggest issue with the current algorithm is however with the 
parallelization approach. Its properties are unfortunately nowhere near linear 
scalability (linear solution time increase with linear parallelism increase or 
constant solution time with linear parallelism increase and linear problem size 
increase). This was expected and is caused by the algorithm itself for the 
following reasons

1) The algorithm works in two steps. First the computation is distributed to 
all partitions, but then gathered and the algorithm is run again on the whole 
data set. This approach may leave most of work for the last sequential step and 
thus gaining very little compared to purely sequential implementation or even 
performing worse. That can happen in case where parallel isotonic regressions 
return a locally optimal solution that will however have to change for a global 
solution in the last step. Another performance drawback in comparison to 
sequential processing is the potential need to copy data to each process.
2) It requires the whole dataset to fit into one process’ memory in the last 
step (or repeated disk access).

I started looking into the issue and was able to design an iterative algorithm 
that adressed both the above issues and performed very close to linear 
scalability. It however still has correctness (rounding) issues and will 
require further research.

Let me know if that helped. In the meantime I will continue working on 
benchmarks and performance quantification of the current algorithm as well as 
on research for potentially more efficient solutions.

> Isotonic regression
> -------------------
>
>                 Key: SPARK-3278
>                 URL: https://issues.apache.org/jira/browse/SPARK-3278
>             Project: Spark
>          Issue Type: New Feature
>          Components: MLlib
>            Reporter: Xiangrui Meng
>            Assignee: Martin Zapletal
>             Fix For: 1.3.0
>
>
> Add isotonic regression for score calibration.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-3278) Isotonic regression

Reply via email to