FYI and for feedback:

  As part of Pull Request #938 I added a “spinner” code in the build() method 
of the UserException class, such that when this method is called (i.e., before 
reporting of a failure to the user), that code can go into a looping spin 
(instead of continuing to termination).

This can be useful when investigating the original failure, allowing to attach 
a debugger, or use jstack to see the stacks at this point of execution, or 
check some external things (like condition of the spill files at that point), 
etc.

To trigger this feature ON, need to create (an empty) flag file named 
/tmp/drill/spin at every node where this stop-spinning needs to take place 
(e.g., use “clush –a touch /tmp/drill/spin” to set it all across the cluster).  
Once a thread hits this code, it checks for the existence of this spin file, 
and if exists, the thread creates a temp file named something like: 
/tmp/drill/spin4148663301172491613.tmp  which contains its process ID (e.g., to 
allow jstack) and the error message, like:

~ 5 > cat /tmp/drill/spin5273075865809469794.tmp
Spinning process: 16966@BBenZvi-E754-MBP13.local
Error cause: SYSTEM ERROR: CannotPlanException: Node 
[rel#232:Subset#10.PHYSICAL.SINGLETON([]).[]] could not be implemented; planner 
state:

Root: rel#232:Subset#10.PHYSICAL.SINGLETON([]).[]
. . . . . . .

~ 6 > jstack 16966
Picked up JAVA_TOOL_OPTIONS: -ea
2017-09-20 17:15:21
Full thread dump Java HotSpot(TM) 64-Bit Server VM (25.101-b13 mixed mode):

"Attach Listener" #91 daemon prio=9 os_prio=31 tid=0x00007fdd8830b000 
nid=0x4f07 waiting on condition [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"263cfbd5-329d-b9fb-d96e-392e4fe0be4d:foreman" #53 daemon prio=10 os_prio=31 
tid=0x00007fdd8823a000 nid=0x7203 waiting on condition [0x0000700002224000]
   java.lang.Thread.State: TIMED_WAITING (sleeping)
     at java.lang.Thread.sleep(Native Method)
     at 
org.apache.drill.common.exceptions.UserException$Builder.build(UserException.java:570)
. . . . . . . .

The spinning thread then loops – sleeps for a second and then rechecks that 
flag file. To turn this feature OFF and release the spinning threads one need 
to delete that empty spin files (e.g., use “clush –a rm /tmp/drill/spin”). This 
will also clean the relevant temp files.

   Hope this is useful, and welcome any feedback or suggestions.

      Boaz

Reply via email to