Re: [Haskell-cafe] Handling a large database (of ngrams)

wren ng thornton Mon, 23 May 2011 02:02:59 -0700

On 5/22/11 8:40 AM, Aleksandar Dimitrov wrote:

If you have too much trouble trying to get SRILM to work, there's
also the Berkeley LM which is easier to install. I'm not familiar
with its inner workings, but it should offer pretty much the same
sorts of operations.


Do you know how BerkeleyLM compares to, say MongoDB and PostgresQL for large
data sets? Maybe this is also the wrong list to ask for this kind of question.

Well, BerlekelyLM is specifically for n-gram language modeling, it's nota general database. According to the paper I mentioned off-list, theentire Google Web1T corpus (approx 1 trillion word tokens, 4 billionn-gram types) can be fit into 10GB of memory, which is much smaller thanSRILM can do.

Databases aren't really my area so I couldn't give a good comparison.Though for this scale of data you're going to want to use somethingspecialized for storing n-grams, rather than a general database. There'sa lot of redundant structure in n-gram counts and you'll want to takeadvantage of that.

For regular projects, that integerization would be enough, but for
your task you'll probably want to spend some time tweaking the
codes. In particular, you'll probably have enough word types to
overflow the space of Int32/Word32 or even Int64/Word64.

Again according to Pauls & Klein (2011), Google Web1T has 13.5M wordtypes, which easily fits into 24-bits. That's for English, somorphologically rich languages will be different. I wouldn't expect toomany problems for German, unless you have a lot of technical text with aprodigious number of unique compound nouns. Even then I'd be surprisedif you went over 2^64 (that'd be reserved for languages like Japanese,Hungarian, Inuit,... if even they'd ever get that bad).


--
Live well,
~wren

_______________________________________________
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Re: [Haskell-cafe] Handling a large database (of ngrams)

Reply via email to