I'm trying to create an n-gram[1] corpus out of song lyrics. I'm breaking individual songs into lines, which are then split into words, so you end up with something like
{0 {0 "go" 1 "tell" 2 "aunt" 3 "rhodie"} 1 {0 "the" 1 "old" 2 "grey" 3 "goose" 4 "is" 5 "dead"}...} (Yes, maps with integer keys is kind of dumb; I thought about using vectors, but this is all going into MongoDB temporarily, and I'd rather just deal with maps instead of messing with Mongo's somewhat lacking array-handling stuff.) The idea, ultimately, is to build a front-end that would allow users to, e.g., search for all songs that contain the (sub)string "aunt rhodie", or see how many times The Rolling Stones use the word "woman" vs how many times the Beatles do, etc. The inspiration comes largely from projects like COCA[2]. I'm wondering if any of you have opinions about which database to use (Mongo is most likely just a stopgap), and how best to architect it. I'm most familiar with MySQL and Mongo, but I'd rather not be limited by just those two if there's a better option out there. I'm thinking that I'll probably end up storing tokens over types--e.g., each word would be stored individually, as opposed to having an entry for, e.g., "the" that stores every instance of the word "the." I was also thinking that I'll probably have to end up storing each token's "previous" and "next", either as full references or just as strings. This seems potentially inefficient, however. (I could've just gone to StackOverflow with this, but figured I'm more likely to get a real answer here, because you all seem so smart and nice?) Thanks! [1] https://en.wikipedia.org/wiki/N-gram [2] http://corpus.byu.edu/coca/ -- You received this message because you are subscribed to the Google Groups "Clojure" group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en --- You received this message because you are subscribed to the Google Groups "Clojure" group. To unsubscribe from this group and stop receiving emails from it, send an email to clojure+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.