Re: RFC: deprecate SparkStatusTracker, remove JobProgressListener

Josh Rosen Fri, 24 Mar 2017 12:08:44 -0700

I think that it should be safe to remove JobProgressListener but I'd like
to keep the SparkStatusTracker API.

SparkStatusTracker was originally developed to provide a stable
programmatic status API for use by Hive on Spark. SparkStatusTracker
predated the Spark REST APIs for status tracking which is why there's some
overlap of functionality between those APIs. Given that SparkStatusTracker
is a longstanding stable public API I'd prefer to not remove it because
there may be a lot of existing user code that depends on it. It's also a
relatively easy-to-support API because it presents a clean query
abstraction and doesn't expose mutable data structures via its public
interface, so we should be able to support this interface with an
implementation based on the new UI database.

JobProgressListener, on the other hand, has a messy interface which was not
designed for use by code outside of Spark. This interface was marked as
@DeveloperAPI as part of Spark 1.0 (see
https://github.com/apache/spark/pull/648) but I think that decision was a
mistake because the interface exposes mutable internal state. For example,
if a user wanted to query completed stages using JobProgressListener then
they would access a field declared as

  val completedStages = ListBuffer[StageInfo]()

which requires the user to explicitly synchronize on the
JobProgressListener instance in order to safely access this field. This is
a bad API and it's not really possible to cleanly present this same
interface with a database-backed implementation. In addition, this
interface has not been fully stable over time and there's currently no
public / DeveloperAPI mechanism to get access to the Spark-constructed
instance of JobProgressListener.

Given all of this this, I think that it's unlikely that users are relying
on JobProgressListener since Spark has other APIs for status tracking which
are more stable and easier to work with. If anyone is relying on this then
they could inline the JobProgressListener source in their own project and
instantiate and register the listener themselves.

Thus I think it's fine to remove JobProgressListener but think we should
keep SparkStatusTracker. I think that the decision of whether we want to
make a next-generation "V2" programmatic status API based on the REST API
types can happen later / independently.

On Thu, Mar 23, 2017 at 1:32 PM Marcelo Vanzin <van...@cloudera.com> wrote:

> Hello all,
>
> For those not following, I'm working on SPARK-18085, where my goal is
> to decouple the storage of UI data from the actual UI implementation.
> This is mostly targeted at the history server, so that it's possible
> to quickly load a "database" with UI information instead of the
> existing way of re-parsing event logs, but I think it also helps with
> the live UI, since it doesn't require storing UI information in memory
> and thus relieves some memory pressure on the driver. (I may still add
> an in-memory database in that project, but that's digressing from the
> topic at hand.)
>
> One of my (unwritten) goals in that project was to get rid of
> JobProgressListener. Now that I'm at a point where I can do that from
> the UI's p.o.v., I ran into SparkStatusTracker. So I'd like to get
> people's views on two topics.
>
> (i) deprecate SparkStatusTracker, provide a new API based on the
> public REST types.
>
> SparkStatusTracker provides yet another way of getting job, stage and
> executor information (aside from the UI and the API). It has its own
> types that model those, which are based on the existing UI types but
> not the same. It could be replaced by making REST calls to the UI
> endpoint, but that's sub-optimal since it doesn't make a lot of sense
> to do that when you already have an instance of SparkContext to play
> with.
>
> Since that's a public, stable API, it can't be removed right away. But
> I'd like to propose that we deprecate it, and provide a new API that
> is based on the REST types (which, with my work, are also used in the
> UI). The existing "SparkStatusTracker" would still exist until we can
> remove it, of course.
>
> What do people think about this approach? Another option is to not add
> the new API, but keep SparkStatusTracker around using the new UI
> database to back it.
>
> (ii) Remove JobProgressListener
>
> I didn't notice it before, but JobProgressListener is public-ish
> (@DeveloperApi). I'm not sure why that is, and it's a weird thing
> because it exposes non-public types (from UIData.scala) in its API.
> With the work I'm doing, and the above suggestion about
> SparkStatusTracker, JobProgressListener becomes unused in Spark
> itself, and keeping it would just mean the driver keeps using unneeded
> memory.
>
> Are there concerns about removing that class? Its functionality is
> available in both SparkStatusTracker and the REST API, so it's mostly
> redundant.
>
>
> So, thoughts?
>
>
> Note to self: (i) above means I'd have to scale back some of my goals
> for SPARK-18085. More specifically, the code that creates the UI
> database will always need to run (just like JobProgressListener always
> exists now), so that SparkStatusTracker still works. Which also means
> moving some of the code I was hoping to keep in a separate module into
> core/.
>
> --
> Marcelo
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

Re: RFC: deprecate SparkStatusTracker, remove JobProgressListener

Reply via email to