It looks like you will have quite a few “combinatoric explosions” to cope with. In addition to 1.5M words, you have bigrams – combinations of two and three words. You need to get a handle on the cardinality of each of your tables. Bigrams and trigrams could give you who knows how many millions more rows than the 1.5M word frequency rows.
And then you have word, bigram, and trigram frequencies by year as well, meaning take the counts from above and multiply by the number of years in your corpus! And then you have word, bigram, and triagram “usage” - and by year as well. Is that every unique sentence from the corpus? Either way, this is an incredible combinatoric explosion. And then there is category and position, which I didn’t look at since you didn’t specify what exactly they are. Once again, start with a focus on cardinality of the data. In short, just as a thought experiment, say that your 1.5M words expanded into 15M rows, divide that into 15Gbytes and that would give you 1000 bytes per row, which may be a bit more than desired, but not totally unreasonable. And maybe the explosion is more like 30 to 1, which would give like 333 bytes per row, which seems quite reasonable. Also, are you doing heavy updates, for each word (and bigram and trigram) as each occurrence is encountered in the corpus or are you counting things in memory and then only writing each row once after the full corpus has been read? Also, what is the corpus size – total word instances, both for the full corpus and for the subset containing your 1.5 million words? -- Jack Krupansky From: Chamila Wijayarathna Sent: Sunday, December 14, 2014 7:01 AM To: user@cassandra.apache.org Subject: Cassandra Database using too much space Hello all, We are trying to develop a language corpus by using Cassandra as its storage medium. https://gist.github.com/cdwijayarathna/7550176443ad2229fae0 shows the types of information we need to extract from corpus interface. So we designed schema at https://gist.github.com/cdwijayarathna/6491122063152669839f to use as the database. Out target is to develop corpus with 100+ million words. By now we have inserted about 1.5 million words and database has used about 14GB space. Is this a normal scenario or are we doing anything wrong? Is there any issue in our data model? Thank You! -- Chamila Dilshan Wijayarathna, SMIEEE, SMIESL, Undergraduate, Department of Computer Science and Engineering, University of Moratuwa.