Re: Enhance batch support - batch demarcation

Chinmay Kolhatkar Mon, 28 Dec 2015 21:55:42 -0800

Hi Thomas,

If I understand correctly, we want to provide a feature to the operator
similar to beginWindow and endWindow callback (as Sandeep said startBatch &
endBatch), but at the mercy of user code.
So user code will define when startBatch & endBatch should get called.


Is my understanding correct?

Thanks,
Chinmay.


~ Chinmay.

On Mon, Dec 28, 2015 at 8:31 PM, Sandeep Deshmukh <[email protected]>
wrote:

> +1 for batch support in Apex. I would be interested to be part of this
> work.
>
> I would like to start with basics and would like to know how one will
> define "batch" in Apex context. Which of the following cases would be
> supported under batch:
>
>    1. A program completes a task and auto shutdown itself once the task is
>    complete. E.g.  the program needs to copy a set of files from source to
>    destination.
>    2. A program completes a task and then waits for pre-defined time to
>    poll for something more to work on. E.g. the program copies all the
> files
>    from source location and then periodically checks, say every 1 hour, if
>    there are new files at the source and copies them.
>    3. A program completes a task and then polls every 1 hr as in case 2 but
>    releases resources during wait time.
>
> Needs for each of the above will vary. I am putting down some basic
> requirements for each of them
>
> 1. This case will need a mechanism to shutdown automatically on completion
> of the task.
>
> StartProgram()
>     StartBatch()
>         Streaming Application starts, runs and finishes
>     EndBatch()
> EndProgram()
>
> 2. This will simply need a construct to wait for some time ( say 10
> minutes) or till some time ( till 1pm) .
>
> StartProgram()
> while(true)
> {
>     StartBatch()
>         Streaming Application starts, runs and finishes
>     EndBatch()
>     WaitTill(time) or WaitFor(timeperiod)
> }
> EndProgram()
>
> 3. Apart from wait construct, we also need release resources support
>
> StartProgram()
> while(true)
> {
>     RestartFromSavedState() // if any state is saved previously.
>     StartBatch()
>         Streaming Application starts, runs and finishes
>     EndBatch()
>     SaveState()
>     RelaseResources()
>     WaitTill(time) or WaitFor(timeperiod)
> }
> EndProgram()
>
>
> All the constructs : waitTime(), RestartFromSavedState(), SaveState()
> , RelaseResources()
> could be very well be part of StartBatch() or EndBatch(). I have put them
> separately for clear understanding only.
>
> Another point to think on would be scheduler. A batch job is generally
> triggered as a cron job. Do we still see Apex jobs being triggered by cron
> or would like to include a scheduler within Apex that will trigger jobs
> based on time or on some external trigger or even polling for events.
>
> Regards
> Sandeep
>
> On Mon, Dec 28, 2015 at 5:11 PM, Bhupesh Chawda <[email protected]>
> wrote:
>
> > +1
> >
> > I think in the batch case, application windows may be transparent to the
> > user application / operator logic.  A batch can be thought of as one
> > instantiation of a Apex Dag, from setup() to teardown() for all
> operators.
> > May be we need to define a higher level API which encapsulates a
> streaming
> > application.
> > Something like:
> >
> > StartBatch()
> >   Streaming Application starts, runs and finishes
> > EndBatch()
> >
> > The streaming application will run transparently with all the windowing /
> > checkpointing logic that it currently does. Checkpointing large amounts
> of
> > data may be avoided by either checkpointing at large intervals or even
> > disabling checkpointing for the batch job.
> > Additionally, the external trigger (existence of some file etc. ) can be
> > controlled by the StartBatch() and EndBatch() calls. In all the batch use
> > cases, it is usually the case that once the input is processed
> completely,
> > the batch is done. Example: In map reduce all splits processed means
> batch
> > job is done. Similar primitives can be supported by Apex in order to
> > facilitate the control management in the StartBatch() and EndBatch()
> > methods.
> >
> > -Bhupesh
> >
> > On Mon, Dec 28, 2015 at 1:34 PM, Thomas Weise <[email protected]>
> > wrote:
> >
> > > Following JIRA is open to enhance the support for batch:
> > >
> > > https://issues.apache.org/jira/browse/APEXCORE-235
> > >
> > > One of the challenges with batch on Apex today is that there isn't any
> > > native support to identify begin/end of batch and associate actions to
> > it.
> > > For example, at the beginning we may want to fetch some data needed for
> > all
> > > subsequent processing and at the end perform some finalization action
> or
> > > push to external system (add partition to Hive table or similar).
> > >
> > > Absent native support, the workaround is to add a bunch of ports and
> > extra
> > > operators for propagation and synchronization purposes, which makes
> > > building the batch application with standard operators or development
> of
> > > custom operators rather difficult and inefficient.
> > >
> > > The span of a batch can also be seen as a user defined window, with
> logic
> > > for begin and end. The current "application window" support is limited
> > to a
> > > multiple of streaming window on a per operator basis. In the batch
> case,
> > > the boundary needs to be more flexible - user code needs to be able to
> > > determine begin/endWindow based on external data (existence of files
> > etc.).
> > >
> > > There is another commonality with application window, and that's
> > alignment
> > > of checkpointing. For batches where it is more efficient to redo the
> > > processing instead of checkpointing potentially large amounts of
> > > intermediate state for incremental recovery, it would be nice to be
> able
> > to
> > > say "user window == checkpoint interval".
> > >
> > > This is to float the idea of having a window control that can be
> > influenced
> > > by user code. An operator that identifies the batch boundary tells the
> > > engine about it and corresponding control tuples are submitted through
> > the
> > > stream, leading to callbacks on downstream operators. These control
> > > tuples should
> > > be able to carry contextual information that can be used in downstream
> > > operator logic (file names, schema information etc.)
> > >
> > > I don't expect the current beginWindow/endWindow can be augmented in a
> > > backward compatible way to accommodate this, but a similar optional
> > > interface could be supported to enable batch aware operators and
> > > checkpointing optimization.
> > >
> > > Thoughts?
> > >
> >
>

Re: Enhance batch support - batch demarcation

Reply via email to