There is support for assuming writes are asynchronous. I don't know if it's ever been tried in production though. It also doesn't quite solve the problem.
Right now, collectors only return on success; on a write failure, they exit. We would need to figure out if a write failure means a 200 OK with a special body of the response, or a different HTTP response code. Likewise, agents need to be modified to handle that return code. And we have a bit of a design question. If a collector is up, but can't write, should the agent fail over to a new one, or just retry later? --Ari On Tue, Oct 12, 2010 at 10:25 AM, Bill Graham <[email protected]> wrote: > We had problems with a single DN the other day and all collectors > ultimately died after trying N failed attempts. I believe at least one > of the failures was was during a commit. > > I think backpressure sounds like the right approach, bit it seems like > there would be some practical challenges, particularly around async > writes or commits to HDFS. Ari, how does this behavior work currently, > and would this be difficult to handle? > > In oahc.datacolleciton.writer.SeqFileWriter there's an exception block > with this in it: > > // We don't want to loose anything > log.fatal("IOException when trying to write a chunk, Collector is > going to exit!", e); > DaemonWatcher.bailout(-1); > isRunning = false; > > > > On Tue, Oct 12, 2010 at 9:57 AM, Eric Yang <[email protected]> wrote: >> I thought that is what it is currently doing with one twist, the commit and >> response is async. Collector exits if the file system is unavailable for >> extensive period of time. If it is not doing what's described above, then >> we definitely should fix it. >> >> Regards, >> Eric >> >> >> On 10/11/10 10:49 PM, "Ariel Rabkin" <[email protected]> wrote: >> >> Howdy. >> >> This is an answer to a question Bill asked me recently: can we >> redesign the Collector process to behave better if the filesystem is >> unavailable? >> >> I think we can do this by backpressure. If the write fails, the >> collector should return an error to the agent. And the agent should >> treat it like a post failure, and retry. Thoughts? >> >> --Ar >> >> -- >> Ari Rabkin [email protected] >> UC Berkeley Computer Science Department >> >> > -- Ari Rabkin [email protected] UC Berkeley Computer Science Department
