Re: eager execution and debuggability

Reynold Xin Tue, 08 May 2018 15:48:06 -0700

s/underestimated/overestimated/

On Tue, May 8, 2018 at 3:44 PM Reynold Xin <[email protected]> wrote:


> Marco,
>
> There is understanding how Spark works, and there is finding bugs early in
> their own program. One can perfectly understand how Spark works and still
> find it valuable to get feedback asap, and that's why we built eager
> analysis in the first place.
>
> Also I'm afraid you've significantly underestimated the level of technical
> sophistication of users. In many cases they struggle to get anything to
> work, and performance optimization of their programs is secondary to
> getting things working. As John Ousterhout says, "the greatest performance
> improvement of all is when a system goes from not-working to working".
>
> I really like Ryan's approach. Would be great if it is something more
> turn-key.
>
>
>
>
>
>
> On Tue, May 8, 2018 at 2:35 PM Marco Gaido <[email protected]> wrote:
>
>> I am not sure how this is useful. For students, it is important to
>> understand how Spark works. This can be critical in many decision they have
>> to take (whether and what to cache for instance) in order to have
>> performant Spark application. Creating a eager execution probably can help
>> them having something running more easily, but let them also using Spark
>> knowing less about how it works, thus they are likely to write worse
>> application and to have more problems in debugging any kind of problem
>> which may later (in production) occur (therefore affecting their experience
>> with the tool).
>>
>> Moreover, as Ryan also mentioned, there are tools/ways to force the
>> execution, helping in the debugging phase. So they can achieve without a
>> big effort the same result, but with a big difference: they are aware of
>> what is really happening, which may help them later.
>>
>> Thanks,
>> Marco
>>
>> 2018-05-08 21:37 GMT+02:00 Ryan Blue <[email protected]>:
>>
>>> At Netflix, we use Jupyter notebooks and consoles for interactive
>>> sessions. For anyone interested, this mode of interaction is really easy to
>>> add in Jupyter and PySpark. You would just define a different
>>> *repr_html* or *repr* method for Dataset that runs a take(10) or
>>> take(100) and formats the result.
>>>
>>> That way, the output of a cell or console execution always causes the
>>> dataframe to run and get displayed for that immediate feedback. But, there
>>> is no change to Spark’s behavior because the action is run by the REPL, and
>>> only when a dataframe is a result of an execution in order to display it.
>>> Intermediate results wouldn’t be run, but that gives users a way to avoid
>>> too many executions and would still support method chaining in the
>>> dataframe API (which would be horrible with an aggressive execution model).
>>>
>>> There are ways to do this in JVM languages as well if you are using a
>>> Scala or Java interpreter (see jvm-repr
>>> <https://github.com/jupyter/jvm-repr>). This is actually what we do in
>>> our Spark-based SQL interpreter to display results.
>>>
>>> rb
>>> 
>>>
>>> On Tue, May 8, 2018 at 12:05 PM, Koert Kuipers <[email protected]>
>>> wrote:
>>>
>>>> yeah we run into this all the time with new hires. they will send
>>>> emails explaining there is an error in the .write operation and they are
>>>> debugging the writing to disk, focusing on that piece of code :)
>>>>
>>>> unrelated, but another frequent cause for confusion is cascading
>>>> errors. like the FetchFailedException. they will be debugging the reducer
>>>> task not realizing the error happened before that, and the
>>>> FetchFailedException is not the root cause.
>>>>
>>>>
>>>> On Tue, May 8, 2018 at 2:52 PM, Reynold Xin <[email protected]>
>>>> wrote:
>>>>
>>>>> Similar to the thread yesterday about improving ML/DL integration, I'm
>>>>> sending another email on what I've learned recently from Spark users. I
>>>>> recently talked to some educators that have been teaching Spark in their
>>>>> (top-tier) university classes. They are some of the most important users
>>>>> for adoption because of the multiplicative effect they have on the future
>>>>> generation.
>>>>>
>>>>> To my surprise the single biggest ask they want is to enable eager
>>>>> execution mode on all operations for teaching and debuggability:
>>>>>
>>>>> (1) Most of the students are relatively new to programming, and they
>>>>> need multiple iterations to even get the most basic operation right. In
>>>>> these cases, in order to trigger an error, they would need to explicitly
>>>>> add actions, which is non-intuitive.
>>>>>
>>>>> (2) If they don't add explicit actions to every operation and there is
>>>>> a mistake, the error pops up somewhere later where an action is triggered.
>>>>> This is in a different position from the code that causes the problem, and
>>>>> difficult for students to correlate the two.
>>>>>
>>>>> I suspect in the real world a lot of Spark users also struggle in
>>>>> similar ways as these students. While eager execution is really not
>>>>> practical in big data, in learning environments or in development against
>>>>> small, sampled datasets it can be pretty helpful.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>>
>>> --
>>> Ryan Blue
>>> Software Engineer
>>> Netflix
>>>
>>
>>

Re: eager execution and debuggability

Reply via email to