Hi:
The documentation for Sink.addBatch is as follows:
/** * Adds a batch of data to this sink. The data for a given `batchId` is
deterministic and if * this method is called more than once with the same
batchId (which will happen in the case of * failures), then `data` should
only be added once. * * Note 1: You cannot apply any operators on `data`
except consuming it (e.g., `collect/foreach`). * Otherwise, you may get a
wrong result. * * Note 2: The method is supposed to be executed
synchronously, i.e. the method should only return * after data is consumed by
sink successfully. */ def addBatch(batchId: Long, data: DataFrame): Unit
A few questions about the data is each DataFrame passed as the argument to
addBatch - 1. Is it all the data in a partition for each trigger or is it all
the data in that trigger ? 2. Is there a way to control the size in each
addBatch invocation to make sure that we don't run into OOM exception on the
executor while calling collect ?
Thanks