I looked into this after I opened that JIRA and it’s actually a bit harder to fix. While changing these visit() calls to use a stack manually instead of being recursive helps avoid a StackOverflowError there, you still get a StackOverflowError when you send the task to a worker node because Java serialization uses recursion. The only real fix therefore with the current codebase is to increase your JVM stack size. Longer-term, I’d like us to automatically call checkpoint() to break lineage graphs when they exceed a certain size, which would avoid the problems in both DAGScheduler and Java serialization. We could also manually add this to ALS now without having a solution for other programs. That would be a great change to make to fix this JIRA. Matei On Jan 25, 2014, at 11:06 PM, Ewen Cheslack-Postava <m...@ewencp.org> wrote:
|
- Any suggestion about JIRA 1006 "MLlib ALS gets ... Xia, Junluan
- Re: Any suggestion about JIRA 1006 "MLlib ... Reynold Xin
- Re: Any suggestion about JIRA 1006 "ML... Aaron Davidson
- Re: Any suggestion about JIRA 1006 &quo... Reynold Xin
- RE: Any suggestion about JIRA 1006 ... Shao, Saisai
- Re: Any suggestion about JIRA ... Ewen Cheslack-Postava
- Re: Any suggestion about J... Matei Zaharia
- Re: Any suggestion abo... Nick Pentreath
- Re: Any suggestion abo... Sean Owen
- Re: Any suggestion abo... Nick Pentreath
- RE: Any suggestion abo... Xia, Junluan
- Re: Any suggestion about JIRA ... Qiuzhuang Lian
- Re: Any suggestion about J... Cheng Lian
- Re: Any suggestion about JIRA 1006 "MLlib ... Evan Chan
- Re: Any suggestion about JIRA 1006 "ML... Matei Zaharia