I'm trying to create an n-gram[1] corpus out of song lyrics. I'm breaking 
individual songs into lines, which are then split into words, so you end up 
with something like

{0 {0 "go" 1 "tell" 2 "aunt" 3 "rhodie"} 1 {0 "the" 1 "old" 2 "grey" 3 
"goose" 4 "is" 5 "dead"}...}

(Yes, maps with integer keys is kind of dumb; I thought about using 
vectors, but this is all going into MongoDB temporarily, and I'd rather 
just deal with maps instead of messing with Mongo's somewhat lacking 
array-handling stuff.)

The idea, ultimately, is to build a front-end that would allow users to, 
e.g., search for all songs that contain the (sub)string "aunt rhodie", or 
see how many times The Rolling Stones use the word "woman" vs how many 
times the Beatles do, etc. The inspiration comes largely from projects like 
COCA[2]. 

I'm wondering if any of you have opinions about which database to use 
(Mongo is most likely just a stopgap), and how best to architect it. I'm 
most familiar with MySQL and Mongo, but I'd rather not be limited by just 
those two if there's a better option out there. I'm thinking that I'll 
probably end up storing tokens over types--e.g., each word would be stored 
individually, as opposed to having an entry for, e.g., "the" that stores 
every instance of the word "the." I was also thinking that I'll probably 
have to end up storing each token's "previous" and "next", either as full 
references or just as strings. This seems potentially inefficient, however. 

(I could've just gone to StackOverflow with this, but figured I'm more 
likely to get a real answer here, because you all seem so smart and nice?)


Thanks!



[1] https://en.wikipedia.org/wiki/N-gram
[2] http://corpus.byu.edu/coca/

-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
"Clojure" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to clojure+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to