While I egged Robin on to some extent on this topic by IM, I should point out the following.
We run large amounts of text through Java at Basis, and we always use String. I have an 8G laptop :-), but there you have it. Anything we do in English we do shortly afterwards in Arabic (UTF-8=UTF-16) and Hanzi (UTF-8>UTF-16) so it doesn't make sense for us to optimize this. Obviously, compression is an option in various ways, and we could imagine some magic containers that optimized string storage in one way or the other. On Sat, Jan 16, 2010 at 9:10 AM, Robin Anil <robin.a...@gmail.com> wrote: > Currently java strings use double the space of the characters in it because > its all in utf-16. A 190MB dictionary file therefore uses around 600MB when > loaded into a HashMap<String, Integer>. Is there some optimization we could > do in terms of storing them and ensuring that chinese, devanagiri and other > characters dont get messed up in the process. > > Some options benson suggested was: storing just the byte[] form and adding > the the option of supplying the hash function in OpenObjectIntHashmap or > even using a UTF-8 string. > > Or we could leave this alone. I currently estimate the memory requirement > using the formula 8 * ( (int) ( num_chars *2 + 45)/8 ) for strings when > generating the dictionary split for the vectorizer > > Robin >