On 7 March 2015 at 00:25, Sam Raker <sam.ra...@gmail.com> wrote: > I'm trying to create an n-gram[1] corpus out of song lyrics. I'm breaking > individual songs into lines, which are then split into words, so you end up > with something like > > {0 {0 "go" 1 "tell" 2 "aunt" 3 "rhodie"} 1 {0 "the" 1 "old" 2 "grey" 3 > "goose" 4 "is" 5 "dead"}...}
Why split into lines? In this example, "rhodie the" is just as valid a bigram as "tell aunt". It would be more natural to split at a sentence boundary. > (Yes, maps with integer keys is kind of dumb; I thought about using vectors, > but this is all going into MongoDB temporarily, and I'd rather just deal > with maps instead of messing with Mongo's somewhat lacking array-handling > stuff.) > > The idea, ultimately, is to build a front-end that would allow users to, > e.g., search for all songs that contain the (sub)string "aunt rhodie", or > see how many times The Rolling Stones use the word "woman" vs how many times > the Beatles do, etc. The inspiration comes largely from projects like > COCA[2]. > > I'm wondering if any of you have opinions about which database to use (Mongo > is most likely just a stopgap), and how best to architect it. I'm most > familiar with MySQL and Mongo, but I'd rather not be limited by just those > two if there's a better option out there. First up, I think you'll likely want to trade space for speed, and the simplest way to do this is to store every n-gram you're interested in. This means deciding up-front the maximum size of the n-gram you're interested in. You could fairly easily model this in any relational database as: tracks -------- id (serial, primary key) name (text) n_grams ------------ id (serial, primary key) n (integer) n_gram (text) track_n_gram ------------------- track (references tracks.id) n_gram (references n_grams.id) num_occurrences (integer) With that schema, you should be able to answer all of the queries you posed with some simple SQL. Of course, this being the Clojure mailing list you should also consider Datomic, and I think the above could be mapped to a Datomic schema without too much effort. Ray. -- You received this message because you are subscribed to the Google Groups "Clojure" group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en --- You received this message because you are subscribed to the Google Groups "Clojure" group. To unsubscribe from this group and stop receiving emails from it, send an email to clojure+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.