Another question for open discussion amongst the interested and knowledgeable:
Unless anyone has objections I am going to commit two huge changes for MAHOUT-151 and MAHOUT-154, as previously discussed. These take several steps to drive down memory requirements and increase speed a bit. The result is a massive change, but a net reduction in code, complexity, and benefit to performance. But that investigation has shown me that there is even more overhead to be trimmed out. The memory requirements for the data have always been kind of shocking, and that's going to be a lot better shortly, but, I am still astounded to learn this week that an Integer in Java takes *24 bytes* on a 64-bit JVM. This doesn't count the storage for reference -- make it 32 bytes counting the at least one reference. That is just nuts for a a 4-byte value, just nuts. I had seen this when profiling but thought it couldn't be right -- seems it is. So, one idea is to move to, say, long primitives as IDs. Not Comparable<?>, or Long, but long. That's a big ID value and only 8 bytes compared to 32. But, that means Strings don't work anymore. That doesn't seem so great. It would be one more tradeoff of flexibility for performance, but a decent one.