A Lucene index, w/ no storage, positions, etc. (optionally) turned off will be 
very efficient.  Plus, there is virtually no code to write.  I've seen bare 
bones indexes be as little as 20% of the original w/ very fast lookup.  
Furthermore, there are many options available for controlling how much is 
loaded into memory, etc.  Finally, it will handle all the languages you throw 
at it.

-Grant

On Jan 16, 2010, at 9:10 AM, Robin Anil wrote:

> Currently java strings use double the space of the characters in it because
> its all in utf-16. A 190MB dictionary file therefore uses around 600MB when
> loaded into a HashMap<String, Integer>.  Is there some optimization we could
> do in terms of storing them and ensuring that chinese, devanagiri and other
> characters dont get messed up in the process.
> 
> Some options benson suggested was: storing just the byte[] form and adding
> the the option of supplying the hash function in OpenObjectIntHashmap or
> even using a UTF-8 string.
> 
> Or we could leave this alone. I currently estimate the memory requirement
> using the formula 8 *  ( (int) ( num_chars *2  + 45)/8 ) for strings when
> generating the dictionary split for the vectorizer
> 
> Robin

Reply via email to