Re: a more fault-tolerant collector

Ariel Rabkin Tue, 12 Oct 2010 10:41:22 -0700

There is support for assuming writes are asynchronous. I don't know if
it's ever been tried in production though.  It also doesn't quite
solve the problem.


Right now, collectors only return on success; on a write failure, they exit.
We would need to figure out if a write failure means a 200 OK with a
special body of the response, or a different HTTP response code.
Likewise, agents need to be modified to handle that return code.

And we have a bit of a design question. If a collector is up, but
can't write, should the agent fail over to a new one, or just retry
later?

--Ari

On Tue, Oct 12, 2010 at 10:25 AM, Bill Graham <[email protected]> wrote:
> We had problems with a single DN the other day and all collectors
> ultimately died after trying N failed attempts. I believe at least one
> of the failures was was during a commit.
>
> I think backpressure sounds like the right approach, bit it seems like
> there would be some practical challenges, particularly around async
> writes or commits to HDFS. Ari, how does this behavior work currently,
> and would this be difficult to handle?
>
> In oahc.datacolleciton.writer.SeqFileWriter there's an exception block
> with this in it:
>
> // We don't want to loose anything
> log.fatal("IOException when trying to write a chunk, Collector is
> going to exit!", e);
> DaemonWatcher.bailout(-1);
> isRunning = false;
>
>
>
> On Tue, Oct 12, 2010 at 9:57 AM, Eric Yang <[email protected]> wrote:
>> I thought that is what it is currently doing with one twist, the commit and 
>> response is async.  Collector exits if the file system is unavailable for 
>> extensive period of time.  If it is not doing what's described above, then 
>> we definitely should fix it.
>>
>> Regards,
>> Eric
>>
>>
>> On 10/11/10 10:49 PM, "Ariel Rabkin" <[email protected]> wrote:
>>
>> Howdy.
>>
>> This is an answer to a question Bill asked me recently: can we
>> redesign the Collector process to behave better if the filesystem is
>> unavailable?
>>
>> I think we can do this by backpressure. If the write fails, the
>> collector should return an error to the agent. And the agent should
>> treat it like a post failure, and retry.  Thoughts?
>>
>> --Ar
>>
>> --
>> Ari Rabkin [email protected]
>> UC Berkeley Computer Science Department
>>
>>
>



-- 
Ari Rabkin [email protected]
UC Berkeley Computer Science Department

Re: a more fault-tolerant collector

Reply via email to