s/underestimated/overestimated/ On Tue, May 8, 2018 at 3:44 PM Reynold Xin <r...@databricks.com> wrote:
> Marco, > > There is understanding how Spark works, and there is finding bugs early in > their own program. One can perfectly understand how Spark works and still > find it valuable to get feedback asap, and that's why we built eager > analysis in the first place. > > Also I'm afraid you've significantly underestimated the level of technical > sophistication of users. In many cases they struggle to get anything to > work, and performance optimization of their programs is secondary to > getting things working. As John Ousterhout says, "the greatest performance > improvement of all is when a system goes from not-working to working". > > I really like Ryan's approach. Would be great if it is something more > turn-key. > > > > > > > On Tue, May 8, 2018 at 2:35 PM Marco Gaido <marcogaid...@gmail.com> wrote: > >> I am not sure how this is useful. For students, it is important to >> understand how Spark works. This can be critical in many decision they have >> to take (whether and what to cache for instance) in order to have >> performant Spark application. Creating a eager execution probably can help >> them having something running more easily, but let them also using Spark >> knowing less about how it works, thus they are likely to write worse >> application and to have more problems in debugging any kind of problem >> which may later (in production) occur (therefore affecting their experience >> with the tool). >> >> Moreover, as Ryan also mentioned, there are tools/ways to force the >> execution, helping in the debugging phase. So they can achieve without a >> big effort the same result, but with a big difference: they are aware of >> what is really happening, which may help them later. >> >> Thanks, >> Marco >> >> 2018-05-08 21:37 GMT+02:00 Ryan Blue <rb...@netflix.com.invalid>: >> >>> At Netflix, we use Jupyter notebooks and consoles for interactive >>> sessions. For anyone interested, this mode of interaction is really easy to >>> add in Jupyter and PySpark. You would just define a different >>> *repr_html* or *repr* method for Dataset that runs a take(10) or >>> take(100) and formats the result. >>> >>> That way, the output of a cell or console execution always causes the >>> dataframe to run and get displayed for that immediate feedback. But, there >>> is no change to Spark’s behavior because the action is run by the REPL, and >>> only when a dataframe is a result of an execution in order to display it. >>> Intermediate results wouldn’t be run, but that gives users a way to avoid >>> too many executions and would still support method chaining in the >>> dataframe API (which would be horrible with an aggressive execution model). >>> >>> There are ways to do this in JVM languages as well if you are using a >>> Scala or Java interpreter (see jvm-repr >>> <https://github.com/jupyter/jvm-repr>). This is actually what we do in >>> our Spark-based SQL interpreter to display results. >>> >>> rb >>> >>> >>> On Tue, May 8, 2018 at 12:05 PM, Koert Kuipers <ko...@tresata.com> >>> wrote: >>> >>>> yeah we run into this all the time with new hires. they will send >>>> emails explaining there is an error in the .write operation and they are >>>> debugging the writing to disk, focusing on that piece of code :) >>>> >>>> unrelated, but another frequent cause for confusion is cascading >>>> errors. like the FetchFailedException. they will be debugging the reducer >>>> task not realizing the error happened before that, and the >>>> FetchFailedException is not the root cause. >>>> >>>> >>>> On Tue, May 8, 2018 at 2:52 PM, Reynold Xin <r...@databricks.com> >>>> wrote: >>>> >>>>> Similar to the thread yesterday about improving ML/DL integration, I'm >>>>> sending another email on what I've learned recently from Spark users. I >>>>> recently talked to some educators that have been teaching Spark in their >>>>> (top-tier) university classes. They are some of the most important users >>>>> for adoption because of the multiplicative effect they have on the future >>>>> generation. >>>>> >>>>> To my surprise the single biggest ask they want is to enable eager >>>>> execution mode on all operations for teaching and debuggability: >>>>> >>>>> (1) Most of the students are relatively new to programming, and they >>>>> need multiple iterations to even get the most basic operation right. In >>>>> these cases, in order to trigger an error, they would need to explicitly >>>>> add actions, which is non-intuitive. >>>>> >>>>> (2) If they don't add explicit actions to every operation and there is >>>>> a mistake, the error pops up somewhere later where an action is triggered. >>>>> This is in a different position from the code that causes the problem, and >>>>> difficult for students to correlate the two. >>>>> >>>>> I suspect in the real world a lot of Spark users also struggle in >>>>> similar ways as these students. While eager execution is really not >>>>> practical in big data, in learning environments or in development against >>>>> small, sampled datasets it can be pretty helpful. >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>> >>> >>> >>> -- >>> Ryan Blue >>> Software Engineer >>> Netflix >>> >> >>