+1 I think in the batch case, application windows may be transparent to the user application / operator logic. A batch can be thought of as one instantiation of a Apex Dag, from setup() to teardown() for all operators. May be we need to define a higher level API which encapsulates a streaming application. Something like:
StartBatch() Streaming Application starts, runs and finishes EndBatch() The streaming application will run transparently with all the windowing / checkpointing logic that it currently does. Checkpointing large amounts of data may be avoided by either checkpointing at large intervals or even disabling checkpointing for the batch job. Additionally, the external trigger (existence of some file etc. ) can be controlled by the StartBatch() and EndBatch() calls. In all the batch use cases, it is usually the case that once the input is processed completely, the batch is done. Example: In map reduce all splits processed means batch job is done. Similar primitives can be supported by Apex in order to facilitate the control management in the StartBatch() and EndBatch() methods. -Bhupesh On Mon, Dec 28, 2015 at 1:34 PM, Thomas Weise <[email protected]> wrote: > Following JIRA is open to enhance the support for batch: > > https://issues.apache.org/jira/browse/APEXCORE-235 > > One of the challenges with batch on Apex today is that there isn't any > native support to identify begin/end of batch and associate actions to it. > For example, at the beginning we may want to fetch some data needed for all > subsequent processing and at the end perform some finalization action or > push to external system (add partition to Hive table or similar). > > Absent native support, the workaround is to add a bunch of ports and extra > operators for propagation and synchronization purposes, which makes > building the batch application with standard operators or development of > custom operators rather difficult and inefficient. > > The span of a batch can also be seen as a user defined window, with logic > for begin and end. The current "application window" support is limited to a > multiple of streaming window on a per operator basis. In the batch case, > the boundary needs to be more flexible - user code needs to be able to > determine begin/endWindow based on external data (existence of files etc.). > > There is another commonality with application window, and that's alignment > of checkpointing. For batches where it is more efficient to redo the > processing instead of checkpointing potentially large amounts of > intermediate state for incremental recovery, it would be nice to be able to > say "user window == checkpoint interval". > > This is to float the idea of having a window control that can be influenced > by user code. An operator that identifies the batch boundary tells the > engine about it and corresponding control tuples are submitted through the > stream, leading to callbacks on downstream operators. These control > tuples should > be able to carry contextual information that can be used in downstream > operator logic (file names, schema information etc.) > > I don't expect the current beginWindow/endWindow can be augmented in a > backward compatible way to accommodate this, but a similar optional > interface could be supported to enable batch aware operators and > checkpointing optimization. > > Thoughts? >
