A Lucene index, w/ no storage, positions, etc. (optionally) turned off will be very efficient. Plus, there is virtually no code to write. I've seen bare bones indexes be as little as 20% of the original w/ very fast lookup. Furthermore, there are many options available for controlling how much is loaded into memory, etc. Finally, it will handle all the languages you throw at it.
-Grant On Jan 16, 2010, at 9:10 AM, Robin Anil wrote: > Currently java strings use double the space of the characters in it because > its all in utf-16. A 190MB dictionary file therefore uses around 600MB when > loaded into a HashMap<String, Integer>. Is there some optimization we could > do in terms of storing them and ensuring that chinese, devanagiri and other > characters dont get messed up in the process. > > Some options benson suggested was: storing just the byte[] form and adding > the the option of supplying the hash function in OpenObjectIntHashmap or > even using a UTF-8 string. > > Or we could leave this alone. I currently estimate the memory requirement > using the formula 8 * ( (int) ( num_chars *2 + 45)/8 ) for strings when > generating the dictionary split for the vectorizer > > Robin