We have had issues with gathering status on long running jobs.  We have
attempted to draw parallels between the Spark UI/Monitoring API and our
code base.  Due to the separation between code and the execution plan, even
having a guess as to where we are in the process is difficult.  The
Job/Stage/Task information is too abstracted from our code to be easily
digested by non Spark engineers on our team.

Is there a "hook" to which I can attach a piece of code that is triggered
when a point in the plan is reached?  This could be when a SQL command
completes, or when a new DataSet is created, anything really...

It seems Dataset.checkpoint() offers an excellent snapshot position during
execution, but I'm concerned I'm short-circuiting the optimal execution of
the full plan.  I really want these trigger functions to be completely
independent of the actual processing itself.  I'm not looking to extract
information from a Dataset, RDD, or anything else.  I essentially want to
write independent output for status.

If this doesn't exist, is there any desire on the dev team for me to
investigate this feature?

Thank you for any and all help.

Reply via email to