Tim we have a pattern for this called Reconciler that Gaurav has also mentioned. There are some examples for it in Malhar
On Thu, Dec 17, 2015 at 9:47 AM, Timothy Farkas <[email protected]> wrote: > Hi All, > > One of our users is outputting to Cassandra, but they want to handle a > Cassandra failure or Cassandra down time gracefully from an output > operator. Currently a lot of our database operators will just fail and > redeploy continually until the database comes back. This is a bad idea for > a couple of reasons: > > 1 - We rely on buffer server spooling to prevent data loss. If the database > is down for a long time (several hours or a day) we may run out of space to > spool for buffer server since it spools to local disk, and data is purged > only after a window is committed. Furthermore this buffer server problem > will exist for all the Streaming Containers in the dag, not just the one > immediately upstream from the output operator, since data is spooled to > disk for all operators and only removed for windows once a window is > committed. > > 2 - If there is another failure further upstream in the dag, upstream > operators will be redeployed to a checkpoint less than or equal to the > checkpoint of the database operator in the At leas once case. This could > mean redoing several hours or a day worth of computation. > > We should support a mechanism to detect when the connection to a database > is lost and then spool to hdfs using a WAL, and then write the contents of > the WAL into the database once it comes back online. This will save the > local disk space of all the nodes used in the dag and allow it to be used > for only the data being output to the output operator. > > Ticket here if anyone is interested in working on it: > > https://malhar.atlassian.net/browse/MLHR-1951 > > Thanks, > Tim >
