Tim we have a pattern for this called Reconciler that Gaurav has also
mentioned. There are some examples for it in Malhar

On Thu, Dec 17, 2015 at 9:47 AM, Timothy Farkas <[email protected]> wrote:

> Hi All,
>
> One of our users is outputting to Cassandra, but they want to handle a
> Cassandra failure or Cassandra down time gracefully from an output
> operator. Currently a lot of our database operators will just fail and
> redeploy continually until the database comes back. This is a bad idea for
> a couple of reasons:
>
> 1 - We rely on buffer server spooling to prevent data loss. If the database
> is down for a long time (several hours or a day) we may run out of space to
> spool for buffer server since it spools to local disk, and data is purged
> only after a window is committed. Furthermore this buffer server problem
> will exist for all the Streaming Containers in the dag, not just the one
> immediately upstream from the output operator, since data is spooled to
> disk for all operators and only removed for windows once a window is
> committed.
>
> 2 - If there is another failure further upstream in the dag, upstream
> operators will be redeployed to a checkpoint less than or equal to the
> checkpoint of the database operator in the At leas once case. This could
> mean redoing several hours or a day worth of computation.
>
> We should support a mechanism to detect when the connection to a database
> is lost and then spool to hdfs using a WAL, and then write the contents of
> the WAL into the database once it comes back online. This will save the
> local disk space of all the nodes used in the dag and allow it to be used
> for only the data being output to the output operator.
>
> Ticket here if anyone is interested in working on it:
>
> https://malhar.atlassian.net/browse/MLHR-1951
>
> Thanks,
> Tim
>

Reply via email to