It looks like you will have quite a few “combinatoric explosions” to cope with. 
In addition to 1.5M words,  you have bigrams – combinations of two and three 
words. You need to get a handle on the cardinality of each of your tables. 
Bigrams and trigrams could give you who knows how many millions more rows than 
the 1.5M word frequency rows.

And then you have word, bigram, and trigram frequencies by year as well, 
meaning take the counts from above and multiply by the number of years in your 
corpus!

And then you have word, bigram, and triagram “usage”  - and by year as well. Is 
that every unique sentence from the corpus? Either way, this is an incredible 
combinatoric explosion.

And then there is category and position, which I didn’t look at since you 
didn’t specify what exactly they are. Once again, start with a focus on 
cardinality of the data.

In short, just as a thought experiment, say that your 1.5M words expanded into 
15M rows, divide that into 15Gbytes and that would give you 1000 bytes per row, 
which may be a bit more than desired, but not totally unreasonable. And maybe 
the explosion is more like 30 to 1, which would give like 333 bytes per row, 
which seems quite reasonable.

Also, are you doing heavy updates, for each word (and bigram and trigram) as 
each occurrence is encountered in the corpus or are you counting things in 
memory and then only writing each row once after the full corpus has been read?

Also, what is the corpus size – total word instances, both for the full corpus 
and for the subset containing your 1.5 million words?

-- Jack Krupansky

From: Chamila Wijayarathna 
Sent: Sunday, December 14, 2014 7:01 AM
To: user@cassandra.apache.org 
Subject: Cassandra Database using too much space

Hello all, 

We are trying to develop a language corpus by using Cassandra as its storage 
medium.

https://gist.github.com/cdwijayarathna/7550176443ad2229fae0 shows the types of 
information we need to extract from corpus interface. 

So we designed schema at 
https://gist.github.com/cdwijayarathna/6491122063152669839f to use as the 
database. Out target is to develop corpus with 100+ million words.

By now we have inserted about 1.5 million words and database has used about 
14GB space. Is this a normal scenario or are we doing anything wrong? Is there 
any issue in our data model?

Thank You!
-- 

Chamila Dilshan Wijayarathna,
SMIEEE, SMIESL,
Undergraduate,
Department of Computer Science and Engineering,
University of Moratuwa.

Reply via email to