The fundamental problem with stint currently is a consequence of its unfortunate recursive design which renders every single operation exponentially slower than it has to be in the number of bytes used for the size of the integer.
The first step to any stint work is to replace the implementation with a simple array-based backend - only then does it make sense to start thinking about anything beyond the most trivial implementations of anything: getting to this point is something of a priority that we might be looking into soon and this would likely make the library "good enough" for basic use including the one pointed out in this thread, ie this simple change would get it to a point where the OP likely wouldn't have bothered to write a thread about it, even without any compiler intrinsics, assembly code and so on ;)