Hi Tim,
Using this "flag file" gives a finer control (e.g., can be set on a single node) than a System Option. Also with a system option, one needs to start another session to turn the option OFF (would the looping thread then see the option changing ? Or is it using a cashed value ?) This feature is for use by developers (or maybe support), so whichever is easier for us .... Boaz ________________________________ From: Timothy Farkas <tfar...@mapr.com> Sent: Thursday, September 21, 2017 6:31:22 PM To: dev@drill.apache.org Subject: Re: Added "spinner" code to allow debugging of failure cause Hi Boaz, Would it be possible to implement this as a System option, so that there can be a uniform way for toggling these features? Thanks, Tim ________________________________ From: Boaz Ben-Zvi <bben-...@mapr.com> Sent: Wednesday, September 20, 2017 5:23:43 PM To: dev@drill.apache.org Subject: Added "spinner" code to allow debugging of failure cause FYI and for feedback: As part of Pull Request #938 I added a “spinner” code in the build() method of the UserException class, such that when this method is called (i.e., before reporting of a failure to the user), that code can go into a looping spin (instead of continuing to termination). This can be useful when investigating the original failure, allowing to attach a debugger, or use jstack to see the stacks at this point of execution, or check some external things (like condition of the spill files at that point), etc. To trigger this feature ON, need to create (an empty) flag file named /tmp/drill/spin at every node where this stop-spinning needs to take place (e.g., use “clush –a touch /tmp/drill/spin” to set it all across the cluster). Once a thread hits this code, it checks for the existence of this spin file, and if exists, the thread creates a temp file named something like: /tmp/drill/spin4148663301172491613.tmp which contains its process ID (e.g., to allow jstack) and the error message, like: ~ 5 > cat /tmp/drill/spin5273075865809469794.tmp Spinning process: 16966@BBenZvi-E754-MBP13.local Error cause: SYSTEM ERROR: CannotPlanException: Node [rel#232:Subset#10.PHYSICAL.SINGLETON([]).[]] could not be implemented; planner state: Root: rel#232:Subset#10.PHYSICAL.SINGLETON([]).[] . . . . . . . ~ 6 > jstack 16966 Picked up JAVA_TOOL_OPTIONS: -ea 2017-09-20 17:15:21 Full thread dump Java HotSpot(TM) 64-Bit Server VM (25.101-b13 mixed mode): "Attach Listener" #91 daemon prio=9 os_prio=31 tid=0x00007fdd8830b000 nid=0x4f07 waiting on condition [0x0000000000000000] java.lang.Thread.State: RUNNABLE "263cfbd5-329d-b9fb-d96e-392e4fe0be4d:foreman" #53 daemon prio=10 os_prio=31 tid=0x00007fdd8823a000 nid=0x7203 waiting on condition [0x0000700002224000] java.lang.Thread.State: TIMED_WAITING (sleeping) at java.lang.Thread.sleep(Native Method) at org.apache.drill.common.exceptions.UserException$Builder.build(UserException.java:570) . . . . . . . . The spinning thread then loops – sleeps for a second and then rechecks that flag file. To turn this feature OFF and release the spinning threads one need to delete that empty spin files (e.g., use “clush –a rm /tmp/drill/spin”). This will also clean the relevant temp files. Hope this is useful, and welcome any feedback or suggestions. Boaz