On 7 March 2015 at 00:25, Sam Raker <sam.ra...@gmail.com> wrote:
> I'm trying to create an n-gram[1] corpus out of song lyrics. I'm breaking
> individual songs into lines,  which are then split into words, so you end up
> with something like
>
> {0 {0 "go" 1 "tell" 2 "aunt" 3 "rhodie"} 1 {0 "the" 1 "old" 2 "grey" 3
> "goose" 4 "is" 5 "dead"}...}

Why split into lines? In this example, "rhodie the" is just as valid a
bigram as "tell aunt". It would be more natural to split at a sentence
boundary.

> (Yes, maps with integer keys is kind of dumb; I thought about using vectors,
> but this is all going into MongoDB temporarily, and I'd rather just deal
> with maps instead of messing with Mongo's somewhat lacking array-handling
> stuff.)
>
> The idea, ultimately, is to build a front-end that would allow users to,
> e.g., search for all songs that contain the (sub)string "aunt rhodie", or
> see how many times The Rolling Stones use the word "woman" vs how many times
> the Beatles do, etc. The inspiration comes largely from projects like
> COCA[2].
>
> I'm wondering if any of you have opinions about which database to use (Mongo
> is most likely just a stopgap), and how best to architect it. I'm most
> familiar with MySQL and Mongo, but I'd rather not be limited by just those
> two if there's a better option out there.

First up, I think you'll likely want to trade space for speed, and the
simplest way to do this is to store every n-gram you're interested in.
This means deciding up-front the maximum size of the n-gram you're
interested in. You could fairly easily model this in any relational
database as:

tracks
--------
id (serial, primary key)
name (text)

n_grams
------------
id (serial, primary key)
n (integer)
n_gram (text)

track_n_gram
-------------------
track (references tracks.id)
n_gram (references n_grams.id)
num_occurrences (integer)

With that schema, you should be able to answer all of the queries you
posed with some simple SQL.

Of course, this being the Clojure mailing list you should also
consider Datomic, and I think the above could be mapped to a Datomic
schema without too much effort.

Ray.

-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
"Clojure" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to clojure+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to