Re: [OT?] Best DB/architecture for n-gram corpus?
OK, I see. Well, on non-trivially sized corpora, I think storage requirements can become an issue, and in a situation where you're handling user queries one might wonder how often someone will query a 10-gram. But if you can make it work, go nuts! For a lot of statistical language modeling there seems to be a sweet spot at the 3-gram point. I feel like I even saw a paper recently that compared different human languages and concluded something about the importance of trigrams, but I can't find it now. On Tue, Mar 10, 2015 at 10:58 AM, Sam Raker sam.ra...@gmail.com wrote: I more meant deciding on a maximum size and storing them qua ngrams--it seems limiting. On the other hand, after a certain size, they stop being ngrams and start being something else--texts, possibly. On Tuesday, March 10, 2015 at 1:29:44 PM UTC-4, John Wiseman wrote: By hard coding n-grams, do you mean using the simple string representation, e.g. aunt rhodie as the key in your database? If so, then maybe it helps to think of it from the perspective that it's not really just text, it's a string that encodes an n-gram just like [\aunt\, \rhodie\] is another way to encode an n-gram--the encoding/decoding uses clojure.string/join and clojure.string/split instead of json/write and json/read, and escaping tokens that contain spaces is on your TODO list at a low priority :) (And I think the Google n-gram corpus https://catalog.ldc.upenn.edu/LDC2006T13 uses the same format.) John On Mon, Mar 9, 2015 at 7:09 PM, Sam Raker sam@gmail.com wrote: That's interesting. I've been really reluctant to hard code n-grams, but it's probably the best way to go. On Monday, March 9, 2015 at 6:12:43 PM UTC-4, John Wiseman wrote: One thing you can do is index 1, 2, 3...n-grams and use a simple fast key-value store (like leveldb etc.) e.g., you could have entries like aunt rhodie - song-9, song-44 woman - song-12, song-65, song-96 That's basically how I made the Metafilter N-gram Viewer http://mefingram.appspot.com/, a clone of Google Books Ngram Viewer https://books.google.com/ngrams. Another possibility is using Lucene. Just be aware that Lucene calls n-grams of characters (au, un, nt) n-grams but it calls n-grams of words (that the, the old, old gray) shingles. So you would end up using (I think, I haven't done this) the ShingleFilter https://lucene.apache.org/core/4_2_0/analyzers-common/org/apache/lucene/analysis/shingle/ShingleFilter.html . You might also find this article by Russ Cox interesting, where he describes building and using an inverted trigram index: http://swtch.com/~rsc/regexp/regexp4.html John Three things that you might find interesting: Russ Cox' explanation of doing indexing and retrieval with an inverted trigram index: http://swtch.com/~rsc/regexp/regexp4.html On Sat, Mar 7, 2015 at 3:22 AM, Matching Socks phill...@gmail.com wrote: A lot of guys would use Lucene. Lucene calls n-grams of words shingles. [1] As for architecture, here is a suggestion to use Lucene to find keys to records in your real database. [2] [1] https://lucidworks.com/blog/whats-a-shingle-in-lucene-parlance/ [2] https://groups.google.com/d/msg/datomic/8yrCYxcQq34/GIomGaarX5QJ -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clo...@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+u...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en --- You received this message because you are subscribed to the Google Groups Clojure group. To unsubscribe from this group and stop receiving emails from it, send an email to clojure+u...@googlegroups.com. For more options, visit https://groups.google.com/d/optout. -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clo...@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+u...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en --- You received this message because you are subscribed to the Google Groups Clojure group. To unsubscribe from this group and stop receiving emails from it, send an email to clojure+u...@googlegroups.com. For more options, visit https://groups.google.com/d/optout. -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this
Re: [OT?] Best DB/architecture for n-gram corpus?
That's honestly closer to what I was originally envisioning--I've never really looked into graph dbs before, but I'll check out Neo4j tonight. Do you know whether you can model multiple edges between the same nodes? I'd love to be able to have POS-based wildcarding as a feature, so you could search for e.g. the ADJ goose, but that's a whole other layer of stuff, so it might go in the eventually, maybe pile. On Tuesday, March 10, 2015 at 3:47:37 PM UTC-4, Ray Miller wrote: On 10 March 2015 at 17:58, Sam Raker sam@gmail.com javascript: wrote: I more meant deciding on a maximum size and storing them qua ngrams--it seems limiting. On the other hand, after a certain size, they stop being ngrams and start being something else--texts, possibly. Exactly. When I first read your post, I almost suggested you model this in a graph database like Neo4j or Titan. Each word would be a node in the graph with an edge linking it to the next word in the sentence. You could define an index on the words (so retrieving all nodes for a given word would be fast), then follow edges to find and count particular n-grams. This is more complicated than the relational model I proposed, and will be a bit slower to query. But if you don't want to put an upper-bound on the length of the n-gram when you index the data, it might be the way to go. Ray. -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en --- You received this message because you are subscribed to the Google Groups Clojure group. To unsubscribe from this group and stop receiving emails from it, send an email to clojure+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: [OT?] Best DB/architecture for n-gram corpus?
POS tagging is a solved-enough problem, at least in most domains. Clojure still doesn't have its own NLTK (HINT HINT, GSOC kids!), but I'm sure I can find a Java lib or 2 that should do the job well enough. On Tuesday, March 10, 2015 at 4:27:17 PM UTC-4, Ray Miller wrote: On 10 March 2015 at 20:03, Sam Raker sam@gmail.com javascript: wrote: That's honestly closer to what I was originally envisioning--I've never really looked into graph dbs before, but I'll check out Neo4j tonight. Do you know whether you can model multiple edges between the same nodes? Yes, certainly possible. If you go for Neo4j you have two options for Clojure: embedded (with the borneo library and reading Javadoc, or plain Java interop) or stand-alone server with REST API (with the well-documented Neocons library from Clojurewerkz). You'll also have to think about how to model which text (song) each phrase came from - likely another node type in the graph with a linking edge to the start of the phrase. Great book on Neo4j available for free download, also covers data modelling: http://neo4j.com/books/ I'd love to be able to have POS-based wildcarding as a feature, so you could search for e.g. the ADJ goose, but that's a whole other layer of stuff, so it might go in the eventually, maybe pile. Sounds like fun, but means doing some natural language processing on the input texts, which is a much more difficult problem than simply tokenizing. Ray. -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en --- You received this message because you are subscribed to the Google Groups Clojure group. To unsubscribe from this group and stop receiving emails from it, send an email to clojure+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: [OT?] Best DB/architecture for n-gram corpus?
Have looked at http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/ that comes with Postgres 9.4 and it's really really powerful and fast. On 03/06/2015 09:25 PM, Sam Raker wrote: I'm trying to create an n-gram[1] corpus out of song lyrics. I'm breaking individual songs into lines, which are then split into words, so you end up with something like {0 {0 go 1 tell 2 aunt 3 rhodie} 1 {0 the 1 old 2 grey 3 goose 4 is 5 dead}...} (Yes, maps with integer keys is kind of dumb; I thought about using vectors, but this is all going into MongoDB temporarily, and I'd rather just deal with maps instead of messing with Mongo's somewhat lacking array-handling stuff.) The idea, ultimately, is to build a front-end that would allow users to, e.g., search for all songs that contain the (sub)string aunt rhodie, or see how many times The Rolling Stones use the word woman vs how many times the Beatles do, etc. The inspiration comes largely from projects like COCA[2]. I'm wondering if any of you have opinions about which database to use (Mongo is most likely just a stopgap), and how best to architect it. I'm most familiar with MySQL and Mongo, but I'd rather not be limited by just those two if there's a better option out there. I'm thinking that I'll probably end up storing tokens over types--e.g., each word would be stored individually, as opposed to having an entry for, e.g., the that stores every instance of the word the. I was also thinking that I'll probably have to end up storing each token's previous and next, either as full references or just as strings. This seems potentially inefficient, however. (I could've just gone to StackOverflow with this, but figured I'm more likely to get a real answer here, because you all seem so smart and nice?) Thanks! [1] https://en.wikipedia.org/wiki/N-gram [2] http://corpus.byu.edu/coca/ -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en --- You received this message because you are subscribed to the Google Groups Clojure group. To unsubscribe from this group and stop receiving emails from it, send an email to clojure+unsubscr...@googlegroups.com mailto:clojure+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout. -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en --- You received this message because you are subscribed to the Google Groups Clojure group. To unsubscribe from this group and stop receiving emails from it, send an email to clojure+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: [OT?] Best DB/architecture for n-gram corpus?
By hard coding n-grams, do you mean using the simple string representation, e.g. aunt rhodie as the key in your database? If so, then maybe it helps to think of it from the perspective that it's not really just text, it's a string that encodes an n-gram just like [\aunt\, \rhodie\] is another way to encode an n-gram--the encoding/decoding uses clojure.string/join and clojure.string/split instead of json/write and json/read, and escaping tokens that contain spaces is on your TODO list at a low priority :) (And I think the Google n-gram corpus https://catalog.ldc.upenn.edu/LDC2006T13 uses the same format.) John On Mon, Mar 9, 2015 at 7:09 PM, Sam Raker sam.ra...@gmail.com wrote: That's interesting. I've been really reluctant to hard code n-grams, but it's probably the best way to go. On Monday, March 9, 2015 at 6:12:43 PM UTC-4, John Wiseman wrote: One thing you can do is index 1, 2, 3...n-grams and use a simple fast key-value store (like leveldb etc.) e.g., you could have entries like aunt rhodie - song-9, song-44 woman - song-12, song-65, song-96 That's basically how I made the Metafilter N-gram Viewer http://mefingram.appspot.com/, a clone of Google Books Ngram Viewer https://books.google.com/ngrams. Another possibility is using Lucene. Just be aware that Lucene calls n-grams of characters (au, un, nt) n-grams but it calls n-grams of words (that the, the old, old gray) shingles. So you would end up using (I think, I haven't done this) the ShingleFilter https://lucene.apache.org/core/4_2_0/analyzers-common/org/apache/lucene/analysis/shingle/ShingleFilter.html . You might also find this article by Russ Cox interesting, where he describes building and using an inverted trigram index: http://swtch.com/~rsc/regexp/regexp4.html John Three things that you might find interesting: Russ Cox' explanation of doing indexing and retrieval with an inverted trigram index: http://swtch.com/~rsc/regexp/regexp4.html On Sat, Mar 7, 2015 at 3:22 AM, Matching Socks phill...@gmail.com wrote: A lot of guys would use Lucene. Lucene calls n-grams of words shingles. [1] As for architecture, here is a suggestion to use Lucene to find keys to records in your real database. [2] [1] https://lucidworks.com/blog/whats-a-shingle-in-lucene-parlance/ [2] https://groups.google.com/d/msg/datomic/8yrCYxcQq34/GIomGaarX5QJ -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clo...@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+u...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en --- You received this message because you are subscribed to the Google Groups Clojure group. To unsubscribe from this group and stop receiving emails from it, send an email to clojure+u...@googlegroups.com. For more options, visit https://groups.google.com/d/optout. -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en --- You received this message because you are subscribed to the Google Groups Clojure group. To unsubscribe from this group and stop receiving emails from it, send an email to clojure+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout. -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en --- You received this message because you are subscribed to the Google Groups Clojure group. To unsubscribe from this group and stop receiving emails from it, send an email to clojure+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: [OT?] Best DB/architecture for n-gram corpus?
I more meant deciding on a maximum size and storing them qua ngrams--it seems limiting. On the other hand, after a certain size, they stop being ngrams and start being something else--texts, possibly. On Tuesday, March 10, 2015 at 1:29:44 PM UTC-4, John Wiseman wrote: By hard coding n-grams, do you mean using the simple string representation, e.g. aunt rhodie as the key in your database? If so, then maybe it helps to think of it from the perspective that it's not really just text, it's a string that encodes an n-gram just like [\aunt\, \rhodie\] is another way to encode an n-gram--the encoding/decoding uses clojure.string/join and clojure.string/split instead of json/write and json/read, and escaping tokens that contain spaces is on your TODO list at a low priority :) (And I think the Google n-gram corpus https://catalog.ldc.upenn.edu/LDC2006T13 uses the same format.) John On Mon, Mar 9, 2015 at 7:09 PM, Sam Raker sam@gmail.com javascript: wrote: That's interesting. I've been really reluctant to hard code n-grams, but it's probably the best way to go. On Monday, March 9, 2015 at 6:12:43 PM UTC-4, John Wiseman wrote: One thing you can do is index 1, 2, 3...n-grams and use a simple fast key-value store (like leveldb etc.) e.g., you could have entries like aunt rhodie - song-9, song-44 woman - song-12, song-65, song-96 That's basically how I made the Metafilter N-gram Viewer http://mefingram.appspot.com/, a clone of Google Books Ngram Viewer https://books.google.com/ngrams. Another possibility is using Lucene. Just be aware that Lucene calls n-grams of characters (au, un, nt) n-grams but it calls n-grams of words (that the, the old, old gray) shingles. So you would end up using (I think, I haven't done this) the ShingleFilter https://lucene.apache.org/core/4_2_0/analyzers-common/org/apache/lucene/analysis/shingle/ShingleFilter.html . You might also find this article by Russ Cox interesting, where he describes building and using an inverted trigram index: http://swtch.com/~rsc/regexp/regexp4.html John Three things that you might find interesting: Russ Cox' explanation of doing indexing and retrieval with an inverted trigram index: http://swtch.com/~rsc/regexp/regexp4.html On Sat, Mar 7, 2015 at 3:22 AM, Matching Socks phill...@gmail.com wrote: A lot of guys would use Lucene. Lucene calls n-grams of words shingles. [1] As for architecture, here is a suggestion to use Lucene to find keys to records in your real database. [2] [1] https://lucidworks.com/blog/whats-a-shingle-in-lucene-parlance/ [2] https://groups.google.com/d/msg/datomic/8yrCYxcQq34/GIomGaarX5QJ -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clo...@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+u...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en --- You received this message because you are subscribed to the Google Groups Clojure group. To unsubscribe from this group and stop receiving emails from it, send an email to clojure+u...@googlegroups.com. For more options, visit https://groups.google.com/d/optout. -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clo...@googlegroups.com javascript: Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+u...@googlegroups.com javascript: For more options, visit this group at http://groups.google.com/group/clojure?hl=en --- You received this message because you are subscribed to the Google Groups Clojure group. To unsubscribe from this group and stop receiving emails from it, send an email to clojure+u...@googlegroups.com javascript:. For more options, visit https://groups.google.com/d/optout. -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en --- You received this message because you are subscribed to the Google Groups Clojure group. To unsubscribe from this group and stop receiving emails from it, send an email to clojure+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: [OT?] Best DB/architecture for n-gram corpus?
On 10 March 2015 at 20:03, Sam Raker sam.ra...@gmail.com wrote: That's honestly closer to what I was originally envisioning--I've never really looked into graph dbs before, but I'll check out Neo4j tonight. Do you know whether you can model multiple edges between the same nodes? Yes, certainly possible. If you go for Neo4j you have two options for Clojure: embedded (with the borneo library and reading Javadoc, or plain Java interop) or stand-alone server with REST API (with the well-documented Neocons library from Clojurewerkz). You'll also have to think about how to model which text (song) each phrase came from - likely another node type in the graph with a linking edge to the start of the phrase. Great book on Neo4j available for free download, also covers data modelling: http://neo4j.com/books/ I'd love to be able to have POS-based wildcarding as a feature, so you could search for e.g. the ADJ goose, but that's a whole other layer of stuff, so it might go in the eventually, maybe pile. Sounds like fun, but means doing some natural language processing on the input texts, which is a much more difficult problem than simply tokenizing. Ray. -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en --- You received this message because you are subscribed to the Google Groups Clojure group. To unsubscribe from this group and stop receiving emails from it, send an email to clojure+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: [OT?] Best DB/architecture for n-gram corpus?
On 10 March 2015 at 17:58, Sam Raker sam.ra...@gmail.com wrote: I more meant deciding on a maximum size and storing them qua ngrams--it seems limiting. On the other hand, after a certain size, they stop being ngrams and start being something else--texts, possibly. Exactly. When I first read your post, I almost suggested you model this in a graph database like Neo4j or Titan. Each word would be a node in the graph with an edge linking it to the next word in the sentence. You could define an index on the words (so retrieving all nodes for a given word would be fast), then follow edges to find and count particular n-grams. This is more complicated than the relational model I proposed, and will be a bit slower to query. But if you don't want to put an upper-bound on the length of the n-gram when you index the data, it might be the way to go. Ray. -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en --- You received this message because you are subscribed to the Google Groups Clojure group. To unsubscribe from this group and stop receiving emails from it, send an email to clojure+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: [OT?] Best DB/architecture for n-gram corpus?
One thing you can do is index 1, 2, 3...n-grams and use a simple fast key-value store (like leveldb etc.) e.g., you could have entries like aunt rhodie - song-9, song-44 woman - song-12, song-65, song-96 That's basically how I made the Metafilter N-gram Viewer http://mefingram.appspot.com/, a clone of Google Books Ngram Viewer https://books.google.com/ngrams. Another possibility is using Lucene. Just be aware that Lucene calls n-grams of characters (au, un, nt) n-grams but it calls n-grams of words (that the, the old, old gray) shingles. So you would end up using (I think, I haven't done this) the ShingleFilter https://lucene.apache.org/core/4_2_0/analyzers-common/org/apache/lucene/analysis/shingle/ShingleFilter.html . You might also find this article by Russ Cox interesting, where he describes building and using an inverted trigram index: http://swtch.com/~rsc/regexp/regexp4.html John Three things that you might find interesting: Russ Cox' explanation of doing indexing and retrieval with an inverted trigram index: http://swtch.com/~rsc/regexp/regexp4.html On Sat, Mar 7, 2015 at 3:22 AM, Matching Socks phill.w...@gmail.com wrote: A lot of guys would use Lucene. Lucene calls n-grams of words shingles. [1] As for architecture, here is a suggestion to use Lucene to find keys to records in your real database. [2] [1] https://lucidworks.com/blog/whats-a-shingle-in-lucene-parlance/ [2] https://groups.google.com/d/msg/datomic/8yrCYxcQq34/GIomGaarX5QJ -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en --- You received this message because you are subscribed to the Google Groups Clojure group. To unsubscribe from this group and stop receiving emails from it, send an email to clojure+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout. -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en --- You received this message because you are subscribed to the Google Groups Clojure group. To unsubscribe from this group and stop receiving emails from it, send an email to clojure+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: [OT?] Best DB/architecture for n-gram corpus?
That's interesting. I've been really reluctant to hard code n-grams, but it's probably the best way to go. On Monday, March 9, 2015 at 6:12:43 PM UTC-4, John Wiseman wrote: One thing you can do is index 1, 2, 3...n-grams and use a simple fast key-value store (like leveldb etc.) e.g., you could have entries like aunt rhodie - song-9, song-44 woman - song-12, song-65, song-96 That's basically how I made the Metafilter N-gram Viewer http://mefingram.appspot.com/, a clone of Google Books Ngram Viewer https://books.google.com/ngrams. Another possibility is using Lucene. Just be aware that Lucene calls n-grams of characters (au, un, nt) n-grams but it calls n-grams of words (that the, the old, old gray) shingles. So you would end up using (I think, I haven't done this) the ShingleFilter https://lucene.apache.org/core/4_2_0/analyzers-common/org/apache/lucene/analysis/shingle/ShingleFilter.html . You might also find this article by Russ Cox interesting, where he describes building and using an inverted trigram index: http://swtch.com/~rsc/regexp/regexp4.html John Three things that you might find interesting: Russ Cox' explanation of doing indexing and retrieval with an inverted trigram index: http://swtch.com/~rsc/regexp/regexp4.html On Sat, Mar 7, 2015 at 3:22 AM, Matching Socks phill...@gmail.com javascript: wrote: A lot of guys would use Lucene. Lucene calls n-grams of words shingles. [1] As for architecture, here is a suggestion to use Lucene to find keys to records in your real database. [2] [1] https://lucidworks.com/blog/whats-a-shingle-in-lucene-parlance/ [2] https://groups.google.com/d/msg/datomic/8yrCYxcQq34/GIomGaarX5QJ -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clo...@googlegroups.com javascript: Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+u...@googlegroups.com javascript: For more options, visit this group at http://groups.google.com/group/clojure?hl=en --- You received this message because you are subscribed to the Google Groups Clojure group. To unsubscribe from this group and stop receiving emails from it, send an email to clojure+u...@googlegroups.com javascript:. For more options, visit https://groups.google.com/d/optout. -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en --- You received this message because you are subscribed to the Google Groups Clojure group. To unsubscribe from this group and stop receiving emails from it, send an email to clojure+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: [OT?] Best DB/architecture for n-gram corpus?
On 7 March 2015 at 00:25, Sam Raker sam.ra...@gmail.com wrote: I'm trying to create an n-gram[1] corpus out of song lyrics. I'm breaking individual songs into lines, which are then split into words, so you end up with something like {0 {0 go 1 tell 2 aunt 3 rhodie} 1 {0 the 1 old 2 grey 3 goose 4 is 5 dead}...} Why split into lines? In this example, rhodie the is just as valid a bigram as tell aunt. It would be more natural to split at a sentence boundary. (Yes, maps with integer keys is kind of dumb; I thought about using vectors, but this is all going into MongoDB temporarily, and I'd rather just deal with maps instead of messing with Mongo's somewhat lacking array-handling stuff.) The idea, ultimately, is to build a front-end that would allow users to, e.g., search for all songs that contain the (sub)string aunt rhodie, or see how many times The Rolling Stones use the word woman vs how many times the Beatles do, etc. The inspiration comes largely from projects like COCA[2]. I'm wondering if any of you have opinions about which database to use (Mongo is most likely just a stopgap), and how best to architect it. I'm most familiar with MySQL and Mongo, but I'd rather not be limited by just those two if there's a better option out there. First up, I think you'll likely want to trade space for speed, and the simplest way to do this is to store every n-gram you're interested in. This means deciding up-front the maximum size of the n-gram you're interested in. You could fairly easily model this in any relational database as: tracks id (serial, primary key) name (text) n_grams id (serial, primary key) n (integer) n_gram (text) track_n_gram --- track (references tracks.id) n_gram (references n_grams.id) num_occurrences (integer) With that schema, you should be able to answer all of the queries you posed with some simple SQL. Of course, this being the Clojure mailing list you should also consider Datomic, and I think the above could be mapped to a Datomic schema without too much effort. Ray. -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en --- You received this message because you are subscribed to the Google Groups Clojure group. To unsubscribe from this group and stop receiving emails from it, send an email to clojure+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: [OT?] Best DB/architecture for n-gram corpus?
A lot of guys would use Lucene. Lucene calls n-grams of words shingles. [1] As for architecture, here is a suggestion to use Lucene to find keys to records in your real database. [2] [1] https://lucidworks.com/blog/whats-a-shingle-in-lucene-parlance/ [2] https://groups.google.com/d/msg/datomic/8yrCYxcQq34/GIomGaarX5QJ -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en --- You received this message because you are subscribed to the Google Groups Clojure group. To unsubscribe from this group and stop receiving emails from it, send an email to clojure+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
[OT?] Best DB/architecture for n-gram corpus?
I'm trying to create an n-gram[1] corpus out of song lyrics. I'm breaking individual songs into lines, which are then split into words, so you end up with something like {0 {0 go 1 tell 2 aunt 3 rhodie} 1 {0 the 1 old 2 grey 3 goose 4 is 5 dead}...} (Yes, maps with integer keys is kind of dumb; I thought about using vectors, but this is all going into MongoDB temporarily, and I'd rather just deal with maps instead of messing with Mongo's somewhat lacking array-handling stuff.) The idea, ultimately, is to build a front-end that would allow users to, e.g., search for all songs that contain the (sub)string aunt rhodie, or see how many times The Rolling Stones use the word woman vs how many times the Beatles do, etc. The inspiration comes largely from projects like COCA[2]. I'm wondering if any of you have opinions about which database to use (Mongo is most likely just a stopgap), and how best to architect it. I'm most familiar with MySQL and Mongo, but I'd rather not be limited by just those two if there's a better option out there. I'm thinking that I'll probably end up storing tokens over types--e.g., each word would be stored individually, as opposed to having an entry for, e.g., the that stores every instance of the word the. I was also thinking that I'll probably have to end up storing each token's previous and next, either as full references or just as strings. This seems potentially inefficient, however. (I could've just gone to StackOverflow with this, but figured I'm more likely to get a real answer here, because you all seem so smart and nice?) Thanks! [1] https://en.wikipedia.org/wiki/N-gram [2] http://corpus.byu.edu/coca/ -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en --- You received this message because you are subscribed to the Google Groups Clojure group. To unsubscribe from this group and stop receiving emails from it, send an email to clojure+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.