Re: Make proactive check for closure serializability optional?

2019-01-21 Thread Felix Cheung
Agreed on the pros / cons, esp driver could be the data science notebook.
Is it worthwhile making it configurable?



From: Sean Owen 
Sent: Monday, January 21, 2019 10:42 AM
To: Reynold Xin
Cc: dev
Subject: Re: Make proactive check for closure serializability optional?

None except the bug / PR I linked to, which is really just a bug in
the RowMatrix implementation; a 2GB closure isn't reasonable.
I doubt it's much overhead in the common case, because closures are
small and this extra check happens once per execution of the closure.

I can also imagine middle-ground cases where people are dragging along
largeish 10MB closures (like, a model or some data) and this could add
non-trivial memory pressure on the driver. They should be broadcasting
those things, sure.

Given just that I'd leave it alone, but was wondering if anyone had
ever had the same thought or more arguments that it should be
disable-able. In 'production' one would imagine all the closures do
serialize correctly and so this is just a bit overhead that could be
skipped.

On Mon, Jan 21, 2019 at 12:17 PM Reynold Xin  wrote:
>
> Did you actually observe a perf issue?
>
> On Mon, Jan 21, 2019 at 10:04 AM Sean Owen  wrote:
>>
>> The ClosureCleaner proactively checks that closures passed to
>> transformations like RDD.map() are serializable, before they're
>> executed. It does this by just serializing it with the JavaSerializer.
>>
>> That's a nice feature, although there's overhead in always trying to
>> serialize the closure ahead of time, especially if the closure is
>> large. It shouldn't be large, usually. But I noticed it when coming up
>> with this fix: https://github.com/apache/spark/pull/23600
>>
>> It made me wonder, should this be optional, or even not the default?
>> Closures that don't serialize still fail, just later when an action is
>> invoked. I don't feel strongly about it, just checking if anyone had
>> pondered this before.
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Make proactive check for closure serializability optional?

2019-01-21 Thread Sean Owen
None except the bug / PR I linked to, which is really just a bug in
the RowMatrix implementation; a 2GB closure isn't reasonable.
I doubt it's much overhead in the common case, because closures are
small and this extra check happens once per execution of the closure.

I can also imagine middle-ground cases where people are dragging along
largeish 10MB closures (like, a model or some data) and this could add
non-trivial memory pressure on the driver. They should be broadcasting
those things, sure.

Given just that I'd leave it alone, but was wondering if anyone had
ever had the same thought or more arguments that it should be
disable-able. In 'production' one would imagine all the closures do
serialize correctly and so this is just a bit overhead that could be
skipped.

On Mon, Jan 21, 2019 at 12:17 PM Reynold Xin  wrote:
>
> Did you actually observe a perf issue?
>
> On Mon, Jan 21, 2019 at 10:04 AM Sean Owen  wrote:
>>
>> The ClosureCleaner proactively checks that closures passed to
>> transformations like RDD.map() are serializable, before they're
>> executed. It does this by just serializing it with the JavaSerializer.
>>
>> That's a nice feature, although there's overhead in always trying to
>> serialize the closure ahead of time, especially if the closure is
>> large. It shouldn't be large, usually. But I noticed it when coming up
>> with this fix: https://github.com/apache/spark/pull/23600
>>
>> It made me wonder, should this be optional, or even not the default?
>> Closures that don't serialize still fail, just later when an action is
>> invoked. I don't feel strongly about it, just checking if anyone had
>> pondered this before.
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Make proactive check for closure serializability optional?

2019-01-21 Thread Reynold Xin
Did you actually observe a perf issue?

On Mon, Jan 21, 2019 at 10:04 AM Sean Owen  wrote:

> The ClosureCleaner proactively checks that closures passed to
> transformations like RDD.map() are serializable, before they're
> executed. It does this by just serializing it with the JavaSerializer.
>
> That's a nice feature, although there's overhead in always trying to
> serialize the closure ahead of time, especially if the closure is
> large. It shouldn't be large, usually. But I noticed it when coming up
> with this fix: https://github.com/apache/spark/pull/23600
>
> It made me wonder, should this be optional, or even not the default?
> Closures that don't serialize still fail, just later when an action is
> invoked. I don't feel strongly about it, just checking if anyone had
> pondered this before.
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Make proactive check for closure serializability optional?

2019-01-21 Thread Sean Owen
The ClosureCleaner proactively checks that closures passed to
transformations like RDD.map() are serializable, before they're
executed. It does this by just serializing it with the JavaSerializer.

That's a nice feature, although there's overhead in always trying to
serialize the closure ahead of time, especially if the closure is
large. It shouldn't be large, usually. But I noticed it when coming up
with this fix: https://github.com/apache/spark/pull/23600

It made me wonder, should this be optional, or even not the default?
Closures that don't serialize still fail, just later when an action is
invoked. I don't feel strongly about it, just checking if anyone had
pondered this before.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org