Hi Thomas, If I understand correctly, we want to provide a feature to the operator similar to beginWindow and endWindow callback (as Sandeep said startBatch & endBatch), but at the mercy of user code. So user code will define when startBatch & endBatch should get called.
Is my understanding correct? Thanks, Chinmay. ~ Chinmay. On Mon, Dec 28, 2015 at 8:31 PM, Sandeep Deshmukh <[email protected]> wrote: > +1 for batch support in Apex. I would be interested to be part of this > work. > > I would like to start with basics and would like to know how one will > define "batch" in Apex context. Which of the following cases would be > supported under batch: > > 1. A program completes a task and auto shutdown itself once the task is > complete. E.g. the program needs to copy a set of files from source to > destination. > 2. A program completes a task and then waits for pre-defined time to > poll for something more to work on. E.g. the program copies all the > files > from source location and then periodically checks, say every 1 hour, if > there are new files at the source and copies them. > 3. A program completes a task and then polls every 1 hr as in case 2 but > releases resources during wait time. > > Needs for each of the above will vary. I am putting down some basic > requirements for each of them > > 1. This case will need a mechanism to shutdown automatically on completion > of the task. > > StartProgram() > StartBatch() > Streaming Application starts, runs and finishes > EndBatch() > EndProgram() > > 2. This will simply need a construct to wait for some time ( say 10 > minutes) or till some time ( till 1pm) . > > StartProgram() > while(true) > { > StartBatch() > Streaming Application starts, runs and finishes > EndBatch() > WaitTill(time) or WaitFor(timeperiod) > } > EndProgram() > > 3. Apart from wait construct, we also need release resources support > > StartProgram() > while(true) > { > RestartFromSavedState() // if any state is saved previously. > StartBatch() > Streaming Application starts, runs and finishes > EndBatch() > SaveState() > RelaseResources() > WaitTill(time) or WaitFor(timeperiod) > } > EndProgram() > > > All the constructs : waitTime(), RestartFromSavedState(), SaveState() > , RelaseResources() > could be very well be part of StartBatch() or EndBatch(). I have put them > separately for clear understanding only. > > Another point to think on would be scheduler. A batch job is generally > triggered as a cron job. Do we still see Apex jobs being triggered by cron > or would like to include a scheduler within Apex that will trigger jobs > based on time or on some external trigger or even polling for events. > > Regards > Sandeep > > On Mon, Dec 28, 2015 at 5:11 PM, Bhupesh Chawda <[email protected]> > wrote: > > > +1 > > > > I think in the batch case, application windows may be transparent to the > > user application / operator logic. A batch can be thought of as one > > instantiation of a Apex Dag, from setup() to teardown() for all > operators. > > May be we need to define a higher level API which encapsulates a > streaming > > application. > > Something like: > > > > StartBatch() > > Streaming Application starts, runs and finishes > > EndBatch() > > > > The streaming application will run transparently with all the windowing / > > checkpointing logic that it currently does. Checkpointing large amounts > of > > data may be avoided by either checkpointing at large intervals or even > > disabling checkpointing for the batch job. > > Additionally, the external trigger (existence of some file etc. ) can be > > controlled by the StartBatch() and EndBatch() calls. In all the batch use > > cases, it is usually the case that once the input is processed > completely, > > the batch is done. Example: In map reduce all splits processed means > batch > > job is done. Similar primitives can be supported by Apex in order to > > facilitate the control management in the StartBatch() and EndBatch() > > methods. > > > > -Bhupesh > > > > On Mon, Dec 28, 2015 at 1:34 PM, Thomas Weise <[email protected]> > > wrote: > > > > > Following JIRA is open to enhance the support for batch: > > > > > > https://issues.apache.org/jira/browse/APEXCORE-235 > > > > > > One of the challenges with batch on Apex today is that there isn't any > > > native support to identify begin/end of batch and associate actions to > > it. > > > For example, at the beginning we may want to fetch some data needed for > > all > > > subsequent processing and at the end perform some finalization action > or > > > push to external system (add partition to Hive table or similar). > > > > > > Absent native support, the workaround is to add a bunch of ports and > > extra > > > operators for propagation and synchronization purposes, which makes > > > building the batch application with standard operators or development > of > > > custom operators rather difficult and inefficient. > > > > > > The span of a batch can also be seen as a user defined window, with > logic > > > for begin and end. The current "application window" support is limited > > to a > > > multiple of streaming window on a per operator basis. In the batch > case, > > > the boundary needs to be more flexible - user code needs to be able to > > > determine begin/endWindow based on external data (existence of files > > etc.). > > > > > > There is another commonality with application window, and that's > > alignment > > > of checkpointing. For batches where it is more efficient to redo the > > > processing instead of checkpointing potentially large amounts of > > > intermediate state for incremental recovery, it would be nice to be > able > > to > > > say "user window == checkpoint interval". > > > > > > This is to float the idea of having a window control that can be > > influenced > > > by user code. An operator that identifies the batch boundary tells the > > > engine about it and corresponding control tuples are submitted through > > the > > > stream, leading to callbacks on downstream operators. These control > > > tuples should > > > be able to carry contextual information that can be used in downstream > > > operator logic (file names, schema information etc.) > > > > > > I don't expect the current beginWindow/endWindow can be augmented in a > > > backward compatible way to accommodate this, but a similar optional > > > interface could be supported to enable batch aware operators and > > > checkpointing optimization. > > > > > > Thoughts? > > > > > >
