Re: Failure Behavior

John Omernik Tue, 29 Mar 2016 04:51:48 -0700

Makes sense Steven. Thanks.  I see what you are saying about complication
and overhead.


On Mon, Mar 28, 2016 at 10:08 PM, Steven Phillips <[email protected]> wrote:

> If a fragment has already begun execution and sent some data to downstream
> fragments, there is no way to simply restart the failed fragment, because
> we would also have to restart any downstream fragments that consumed that
> output, and so on up the tree, as well as restart any leaf fragments that
> fed into any of those fragments. This is because we don't store
> intermediate results to disk.
>
> The case where I think it would even be possible would be if a node died
> before sending any data downstream. But I think the only way to be sure of
> this would be to poll all of the downstream fragments and verify that no
> data from the failed fragment was ever received. I think this would add a
> lot of complication and overhead to Drill.
>
> On Sat, Mar 26, 2016 at 10:03 AM, John Omernik <[email protected]> wrote:
>
> > Thanks for the responses.. So, even if the drillbit that died wasn't the
> > foreman the query would fail? Interesting... Is there any mechanism for
> > reassigning fragments? *try harder* so to speak?  I guess does this play
> > out too if I have a query and say something on that node caused a
> fragment
> > to fail, that it could be tried somewhere else... So I am not trying to
> > recreate map reduce in Drill (although I am sorta asking about similar
> > features), but in a distributed environment, what is the cost to allow
> the
> > foremen to time out a fragment and try again elsewhere. Say there was a
> > heart beat sent back from the bits running a fragment, and if the
> heartbeat
> > and lack of results exceeded 10 seconds, have the foremen try again
> > somewhere else (up to X times configured by a setting).  I am just
> curious
> > here for my own knowledge what makes that hard in a system like Drill.
> >
> > On Sat, Mar 26, 2016 at 10:47 AM, Abdel Hakim Deneche <
> > [email protected]
> > > wrote:
> >
> > > the query could succeed is if all fragments that were running on the
> > > now-dead node already finished. Other than that, the query fails.
> > >
> > > On Sat, Mar 26, 2016 at 4:45 PM, Neeraja Rentachintala <
> > > [email protected]> wrote:
> > >
> > > > As far as I know, there is no failure handling in Drill. The query
> > dies.
> > > >
> > > > On Sat, Mar 26, 2016 at 7:52 AM, John Omernik <[email protected]>
> > wrote:
> > > >
> > > > > With distributed Drill, what is the expected/desired bit failure
> > > > behavior.
> > > > > I.e. if you are running, and certain fragments end up on a node
> with
> > a
> > > > bit
> > > > > in a flaky state (or a bit that suddenly dies).  What is the
> desired
> > > and
> > > > > actual behavior of the query? I am guessing that if the bit was
> > > foreman,
> > > > > the query dies, I guess that's unavoidable, but if it's just a
> > worker,
> > > > does
> > > > > the foreman detect this and reschedule the fragment or does the
> query
> > > die
> > > > > any way?
> > > > >
> > > > > John
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > >
> > > Abdelhakim Deneche
> > >
> > > Software Engineer
> > >
> > >   <http://www.mapr.com/>
> > >
> > >
> > > Now Available - Free Hadoop On-Demand Training
> > > <
> > >
> >
> http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available
> > > >
> > >
> >
>

Re: Failure Behavior

Reply via email to