Another question for open discussion amongst the interested and knowledgeable:

Unless anyone has objections I am going to commit two huge changes for
MAHOUT-151 and MAHOUT-154, as previously discussed. These take several
steps to drive down memory requirements and increase speed a bit. The
result is a massive change, but a net reduction in code, complexity,
and benefit to performance.

But that investigation has shown me that there is even more overhead
to be trimmed out. The memory requirements for the data have always
been kind of shocking, and that's going to be a lot better shortly,
but, I am still astounded to learn this week that an Integer in Java
takes *24 bytes* on a 64-bit JVM. This doesn't count the storage for
reference -- make it 32 bytes counting the at least one reference.
That is just nuts for a a 4-byte value, just nuts. I had seen this
when profiling but thought it couldn't be right -- seems it is.

So, one idea is to move to, say, long primitives as IDs. Not
Comparable<?>, or Long, but long. That's a big ID value and only 8
bytes compared to 32. But, that means Strings don't work anymore. That
doesn't seem so great. It would be one more tradeoff of flexibility
for performance, but a decent one.

Reply via email to