FYI and for feedback: As part of Pull Request #938 I added a “spinner” code in the build() method of the UserException class, such that when this method is called (i.e., before reporting of a failure to the user), that code can go into a looping spin (instead of continuing to termination).
This can be useful when investigating the original failure, allowing to attach a debugger, or use jstack to see the stacks at this point of execution, or check some external things (like condition of the spill files at that point), etc. To trigger this feature ON, need to create (an empty) flag file named /tmp/drill/spin at every node where this stop-spinning needs to take place (e.g., use “clush –a touch /tmp/drill/spin” to set it all across the cluster). Once a thread hits this code, it checks for the existence of this spin file, and if exists, the thread creates a temp file named something like: /tmp/drill/spin4148663301172491613.tmp which contains its process ID (e.g., to allow jstack) and the error message, like: ~ 5 > cat /tmp/drill/spin5273075865809469794.tmp Spinning process: 16966@BBenZvi-E754-MBP13.local Error cause: SYSTEM ERROR: CannotPlanException: Node [rel#232:Subset#10.PHYSICAL.SINGLETON([]).[]] could not be implemented; planner state: Root: rel#232:Subset#10.PHYSICAL.SINGLETON([]).[] . . . . . . . ~ 6 > jstack 16966 Picked up JAVA_TOOL_OPTIONS: -ea 2017-09-20 17:15:21 Full thread dump Java HotSpot(TM) 64-Bit Server VM (25.101-b13 mixed mode): "Attach Listener" #91 daemon prio=9 os_prio=31 tid=0x00007fdd8830b000 nid=0x4f07 waiting on condition [0x0000000000000000] java.lang.Thread.State: RUNNABLE "263cfbd5-329d-b9fb-d96e-392e4fe0be4d:foreman" #53 daemon prio=10 os_prio=31 tid=0x00007fdd8823a000 nid=0x7203 waiting on condition [0x0000700002224000] java.lang.Thread.State: TIMED_WAITING (sleeping) at java.lang.Thread.sleep(Native Method) at org.apache.drill.common.exceptions.UserException$Builder.build(UserException.java:570) . . . . . . . . The spinning thread then loops – sleeps for a second and then rechecks that flag file. To turn this feature OFF and release the spinning threads one need to delete that empty spin files (e.g., use “clush –a rm /tmp/drill/spin”). This will also clean the relevant temp files. Hope this is useful, and welcome any feedback or suggestions. Boaz