Hakim,

Can you point me to where [3] happens?

Two questions:

+ Why is the root fragment blocked? If the user channel is closed, the
query is cancelled [1], which should cancel and interrupt all running
fragments. This interruption happens regardless of fragment failure that
you have pointed out when user channel is closed [2]. Unless there is there
a blocking call when failure is handled through the channel closed
listener, I don't see why cancellation is not triggered.

+ Why does the Foreman wait forever? AFAIK failures are reported
immediately to the user. Is the root fragment not reported as FAILED to the
Foreman?

Thank you,
Sudheesh

[1]
https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/work/foreman/Foreman.java#L179
[2]
https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/ops/FragmentContext.java#L92

On Thu, Apr 7, 2016 at 6:29 AM, John Omernik <j...@omernik.com> wrote:

> Abdel -
>
> I think I've seen this on a MapR cluster I run, especially on CTAS.  For
> me, I have not brought it up because the cluster I am running on has some
> serious personal issues (like being hardware that's near 7 years old, its a
> test cluster) and given the "hard to reproduce" nature of the problem, I've
> been reluctant to create noise. Given what you've described, it seems very
> similar to CTAS hangs I've seen, but couldn't accurately reproduce.
>
> This didn't add much to your post, but I wanted to give you a +1 for
> outlining this potential problem.  Once I move to more robust hardware, and
> I am in similar situations, I will post more verbose details from my side.
>
> John
>
>
>
> On Thu, Apr 7, 2016 at 2:29 AM, Abdel Hakim Deneche <adene...@maprtech.com
> >
> wrote:
>
> > So, we've been seeing some queries hang, I've come up with a possible
> > explanation, but so far it's really difficult to reproduce. Let me know
> if
> > you think this explanation doesn't hold up or if you have any ideas how
> we
> > can reproduce it. Thanks
> >
> > - generally it's a CTAS running on a large cluster (lot's of writers
> > running in parallel)
> > - logs show that the user channel was closed and UserServer caused the
> root
> > fragment to move to a FAILED state [1]
> > - jstack shows that the root fragment is blocked in it's receiver waiting
> > for data [2]
> > - jstack also shows that ALL other fragments are no longer running, and
> the
> > logs show that all of them succeeded [3]
> > - the foreman waits *forever* for the root fragment to finish
> >
> > [1] the only case I can think off is when the user channel closed while
> the
> > fragment was waiting for an ack from the user client
> > [2] if a writer finishes earlier than the others, it will send a data
> batch
> > to the root fragment that will be sent to the user. The root will then
> > immediately block on it's receiver waiting for the remaining writers to
> > finish
> > [3] once the root fragment moves to a failed state, the receiver will
> > immediately release any received batch and return an OK to the sender
> > without putting the batch in it's blocking queue.
> >
> > Abdelhakim Deneche
> >
> > Software Engineer
> >
> >   <http://www.mapr.com/>
> >
> >
> > Now Available - Free Hadoop On-Demand Training
> > <
> >
> http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available
> > >
> >
>

Reply via email to