Re: Make proactive check for closure serializability optional?
Agreed on the pros / cons, esp driver could be the data science notebook. Is it worthwhile making it configurable? From: Sean Owen Sent: Monday, January 21, 2019 10:42 AM To: Reynold Xin Cc: dev Subject: Re: Make proactive check for closure serializability optional? None except the bug / PR I linked to, which is really just a bug in the RowMatrix implementation; a 2GB closure isn't reasonable. I doubt it's much overhead in the common case, because closures are small and this extra check happens once per execution of the closure. I can also imagine middle-ground cases where people are dragging along largeish 10MB closures (like, a model or some data) and this could add non-trivial memory pressure on the driver. They should be broadcasting those things, sure. Given just that I'd leave it alone, but was wondering if anyone had ever had the same thought or more arguments that it should be disable-able. In 'production' one would imagine all the closures do serialize correctly and so this is just a bit overhead that could be skipped. On Mon, Jan 21, 2019 at 12:17 PM Reynold Xin wrote: > > Did you actually observe a perf issue? > > On Mon, Jan 21, 2019 at 10:04 AM Sean Owen wrote: >> >> The ClosureCleaner proactively checks that closures passed to >> transformations like RDD.map() are serializable, before they're >> executed. It does this by just serializing it with the JavaSerializer. >> >> That's a nice feature, although there's overhead in always trying to >> serialize the closure ahead of time, especially if the closure is >> large. It shouldn't be large, usually. But I noticed it when coming up >> with this fix: https://github.com/apache/spark/pull/23600 >> >> It made me wonder, should this be optional, or even not the default? >> Closures that don't serialize still fail, just later when an action is >> invoked. I don't feel strongly about it, just checking if anyone had >> pondered this before. >> >> - >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >> - To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
Re: Make proactive check for closure serializability optional?
None except the bug / PR I linked to, which is really just a bug in the RowMatrix implementation; a 2GB closure isn't reasonable. I doubt it's much overhead in the common case, because closures are small and this extra check happens once per execution of the closure. I can also imagine middle-ground cases where people are dragging along largeish 10MB closures (like, a model or some data) and this could add non-trivial memory pressure on the driver. They should be broadcasting those things, sure. Given just that I'd leave it alone, but was wondering if anyone had ever had the same thought or more arguments that it should be disable-able. In 'production' one would imagine all the closures do serialize correctly and so this is just a bit overhead that could be skipped. On Mon, Jan 21, 2019 at 12:17 PM Reynold Xin wrote: > > Did you actually observe a perf issue? > > On Mon, Jan 21, 2019 at 10:04 AM Sean Owen wrote: >> >> The ClosureCleaner proactively checks that closures passed to >> transformations like RDD.map() are serializable, before they're >> executed. It does this by just serializing it with the JavaSerializer. >> >> That's a nice feature, although there's overhead in always trying to >> serialize the closure ahead of time, especially if the closure is >> large. It shouldn't be large, usually. But I noticed it when coming up >> with this fix: https://github.com/apache/spark/pull/23600 >> >> It made me wonder, should this be optional, or even not the default? >> Closures that don't serialize still fail, just later when an action is >> invoked. I don't feel strongly about it, just checking if anyone had >> pondered this before. >> >> - >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >> - To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
Re: Make proactive check for closure serializability optional?
Did you actually observe a perf issue? On Mon, Jan 21, 2019 at 10:04 AM Sean Owen wrote: > The ClosureCleaner proactively checks that closures passed to > transformations like RDD.map() are serializable, before they're > executed. It does this by just serializing it with the JavaSerializer. > > That's a nice feature, although there's overhead in always trying to > serialize the closure ahead of time, especially if the closure is > large. It shouldn't be large, usually. But I noticed it when coming up > with this fix: https://github.com/apache/spark/pull/23600 > > It made me wonder, should this be optional, or even not the default? > Closures that don't serialize still fail, just later when an action is > invoked. I don't feel strongly about it, just checking if anyone had > pondered this before. > > - > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > >
Make proactive check for closure serializability optional?
The ClosureCleaner proactively checks that closures passed to transformations like RDD.map() are serializable, before they're executed. It does this by just serializing it with the JavaSerializer. That's a nice feature, although there's overhead in always trying to serialize the closure ahead of time, especially if the closure is large. It shouldn't be large, usually. But I noticed it when coming up with this fix: https://github.com/apache/spark/pull/23600 It made me wonder, should this be optional, or even not the default? Closures that don't serialize still fail, just later when an action is invoked. I don't feel strongly about it, just checking if anyone had pondered this before. - To unsubscribe e-mail: dev-unsubscr...@spark.apache.org