http://www.matasano.com/log/628/rafal-wojtczuks-user-mode-single-stepping-100x-faster-than-debuggers/

Rafal Wojtczuk’s User-Mode Single Stepping: 100x Faster Than Debuggers

Thomas Ptacek | November 30th, 2006 | Filed Under: Bitching About Protocols, Reversing, Uncategorized

‘Tis the season, apparently. Cool new project and excellent blog post from libnids author and Eastern European reversing psychopath Rafal Wojtczuk, now at MCAF’s AVERT labs. He’s announced UMSS, the User Mode Single-Stepper, a tool for tracing the execution of Win32 binaries.

Refresher: single-stepping stops a program after each individual CPU instruction, usually to record them. It’s usually done with a debugger; on Intel, you do it by setting the “trap flag”, which tells the CPU to generate exceptions after each instruction.

dbgss.png

The problem here is, each instruction traps to the kernel, which then transfers control to another process, which then transfers back to the kernel to find out what happened. A single user/kernel (u/k) transition is expensive: network programmers, who execute thousands of instructions between I/O operations, still try to minimize them. Debugger single-stepping involves multiple u/k context switches per instruction. It’s just nightmarishly slow.

Rafal’s project speeds this up by 2 orders of magnitude by single stepping entirely in userland. How he does it is, he continuously rewrites the “next” instruction on the fly to transfer control to a handler function.

umss.png

This is similar to what Detours does in that Rafal is swapping out instructions with handler jumps. But Detours only instruments the prologues of each function. UMSS instruments every instruction, on the fly. This is tricky, because to do that for each instruction, you have to know where the next instruction is. It’s not always “the next instruction in memory”, because of jumps. It’s not always “the target of a jump”, because jumps are conditional. It’s not always even possible to look at an instruction and know the jump target, because jumps can be indirected through registers.

UMSS solves this problem in two ways:

  1. it uses an embedded disassembler to decode jumps with static targets, and peeks at the condition flags to figure out whether jumps will be taken.

  2. it has a simple and clever heuristic for indirected jumps: just switch back to kernel-assisted debugging for that instruction. The overwhelming majority of the instruction stream doesn’t need it, so you still get the huge speedup.

Why is this stuff important? To be honest, I don’t know. The “state of the art” in tracing programs right now is in instrumenting basic blocks, which are the ~10-20 instruction chunks that functions are composed of. For reversing purposes, this level of detail is usually more than enough. Clearly for malware research, where code is deliberately designed to be unclear, instruction-by-instruction detail is critical. I’d love for someone to tell me how I could exploit fast single-stepping to get a different project done.

The bigger story is the apparent renaissance we’re experiencing in binary program manipulation. 7 years ago, technology like Detours, PaiMei, and UMSS would have been the closely-guarded crown jewels of security companies. Now they’re free side-projects.


Reply via email to