Makes sense Steven. Thanks. I see what you are saying about complication and overhead.
On Mon, Mar 28, 2016 at 10:08 PM, Steven Phillips <[email protected]> wrote: > If a fragment has already begun execution and sent some data to downstream > fragments, there is no way to simply restart the failed fragment, because > we would also have to restart any downstream fragments that consumed that > output, and so on up the tree, as well as restart any leaf fragments that > fed into any of those fragments. This is because we don't store > intermediate results to disk. > > The case where I think it would even be possible would be if a node died > before sending any data downstream. But I think the only way to be sure of > this would be to poll all of the downstream fragments and verify that no > data from the failed fragment was ever received. I think this would add a > lot of complication and overhead to Drill. > > On Sat, Mar 26, 2016 at 10:03 AM, John Omernik <[email protected]> wrote: > > > Thanks for the responses.. So, even if the drillbit that died wasn't the > > foreman the query would fail? Interesting... Is there any mechanism for > > reassigning fragments? *try harder* so to speak? I guess does this play > > out too if I have a query and say something on that node caused a > fragment > > to fail, that it could be tried somewhere else... So I am not trying to > > recreate map reduce in Drill (although I am sorta asking about similar > > features), but in a distributed environment, what is the cost to allow > the > > foremen to time out a fragment and try again elsewhere. Say there was a > > heart beat sent back from the bits running a fragment, and if the > heartbeat > > and lack of results exceeded 10 seconds, have the foremen try again > > somewhere else (up to X times configured by a setting). I am just > curious > > here for my own knowledge what makes that hard in a system like Drill. > > > > On Sat, Mar 26, 2016 at 10:47 AM, Abdel Hakim Deneche < > > [email protected] > > > wrote: > > > > > the query could succeed is if all fragments that were running on the > > > now-dead node already finished. Other than that, the query fails. > > > > > > On Sat, Mar 26, 2016 at 4:45 PM, Neeraja Rentachintala < > > > [email protected]> wrote: > > > > > > > As far as I know, there is no failure handling in Drill. The query > > dies. > > > > > > > > On Sat, Mar 26, 2016 at 7:52 AM, John Omernik <[email protected]> > > wrote: > > > > > > > > > With distributed Drill, what is the expected/desired bit failure > > > > behavior. > > > > > I.e. if you are running, and certain fragments end up on a node > with > > a > > > > bit > > > > > in a flaky state (or a bit that suddenly dies). What is the > desired > > > and > > > > > actual behavior of the query? I am guessing that if the bit was > > > foreman, > > > > > the query dies, I guess that's unavoidable, but if it's just a > > worker, > > > > does > > > > > the foreman detect this and reschedule the fragment or does the > query > > > die > > > > > any way? > > > > > > > > > > John > > > > > > > > > > > > > > > > > > > > > -- > > > > > > Abdelhakim Deneche > > > > > > Software Engineer > > > > > > <http://www.mapr.com/> > > > > > > > > > Now Available - Free Hadoop On-Demand Training > > > < > > > > > > http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available > > > > > > > > > >
