Re: Efficient dictionary storage in memory

Benson Margulies Sat, 16 Jan 2010 06:15:57 -0800

While I egged Robin on to some extent on this topic by IM, I should
point out the following.

We run large amounts of text through Java at Basis, and we always use
String. I have an 8G laptop :-), but there you have it. Anything we do
in English we do shortly afterwards in Arabic (UTF-8=UTF-16) and Hanzi
(UTF-8>UTF-16) so it doesn't make sense for us to optimize this.
Obviously, compression is an option in various ways, and we could
imagine some magic containers that optimized string storage in one way
or the other.

On Sat, Jan 16, 2010 at 9:10 AM, Robin Anil <robin.a...@gmail.com> wrote:
> Currently java strings use double the space of the characters in it because
> its all in utf-16. A 190MB dictionary file therefore uses around 600MB when
> loaded into a HashMap<String, Integer>.  Is there some optimization we could
> do in terms of storing them and ensuring that chinese, devanagiri and other
> characters dont get messed up in the process.
>
> Some options benson suggested was: storing just the byte[] form and adding
> the the option of supplying the hash function in OpenObjectIntHashmap or
> even using a UTF-8 string.
>
> Or we could leave this alone. I currently estimate the memory requirement
> using the formula 8 *  ( (int) ( num_chars *2  + 45)/8 ) for strings when
> generating the dictionary split for the vectorizer
>
> Robin
>

Re: Efficient dictionary storage in memory

Reply via email to