Re: [OT?] Best DB/architecture for n-gram corpus?

2015-03-12 Thread John Wiseman
OK, I see.  Well, on non-trivially sized corpora, I think storage
requirements can become an issue, and in a situation where you're handling
user queries one might wonder how often someone will query a 10-gram.  But
if you can make it work, go nuts!

For a lot of statistical language modeling there seems to be a sweet spot
at the 3-gram point.  I feel like I even saw a paper recently that compared
different human languages and concluded something about the importance of
trigrams, but I can't find it now.


On Tue, Mar 10, 2015 at 10:58 AM, Sam Raker sam.ra...@gmail.com wrote:

 I more meant deciding on a maximum size and storing them qua ngrams--it
 seems limiting. On the other hand, after a certain size, they stop being
 ngrams and start being something else--texts, possibly.

 On Tuesday, March 10, 2015 at 1:29:44 PM UTC-4, John Wiseman wrote:

 By hard coding n-grams, do you mean using the simple string
 representation, e.g. aunt rhodie as the key in your database?  If so,
 then maybe it helps to think of it from the perspective that it's not
 really just text, it's a string that encodes an n-gram just like
 [\aunt\, \rhodie\] is another way to encode an n-gram--the
 encoding/decoding uses clojure.string/join and clojure.string/split instead
 of json/write and json/read, and escaping tokens that contain spaces is on
 your TODO list at a low priority :)

 (And I think the Google n-gram corpus
 https://catalog.ldc.upenn.edu/LDC2006T13 uses the same format.)


 John


 On Mon, Mar 9, 2015 at 7:09 PM, Sam Raker sam@gmail.com wrote:

 That's interesting. I've been really reluctant to hard code n-grams,
 but it's probably the best way to go.

 On Monday, March 9, 2015 at 6:12:43 PM UTC-4, John Wiseman wrote:

 One thing you can do is index 1, 2, 3...n-grams and use a simple  fast
 key-value store (like leveldb etc.)  e.g., you could have entries like

 aunt rhodie - song-9, song-44
 woman - song-12, song-65, song-96


 That's basically how I made the Metafilter N-gram Viewer
 http://mefingram.appspot.com/, a clone of Google Books Ngram Viewer
 https://books.google.com/ngrams.

 Another possibility is using Lucene.  Just be aware that Lucene calls
 n-grams of characters (au, un, nt) n-grams but it calls n-grams of
 words (that the, the old, old gray) shingles.  So you would end up
 using (I think, I haven't done this) the ShingleFilter
 https://lucene.apache.org/core/4_2_0/analyzers-common/org/apache/lucene/analysis/shingle/ShingleFilter.html
 .

 You might also find this article by Russ Cox interesting, where he
 describes building and using an inverted trigram index:
 http://swtch.com/~rsc/regexp/regexp4.html


 John





 Three things that you might find interesting:

 Russ Cox' explanation of doing indexing and retrieval with an inverted
 trigram index: http://swtch.com/~rsc/regexp/regexp4.html


 On Sat, Mar 7, 2015 at 3:22 AM, Matching Socks phill...@gmail.com
 wrote:

 A lot of guys would use Lucene.  Lucene calls n-grams of words
 shingles. [1]

 As for architecture, here is a suggestion to use Lucene to find keys
 to records in your real database. [2]

 [1] https://lucidworks.com/blog/whats-a-shingle-in-lucene-parlance/

 [2] https://groups.google.com/d/msg/datomic/8yrCYxcQq34/GIomGaarX5QJ


  --
 You received this message because you are subscribed to the Google
 Groups Clojure group.
 To post to this group, send email to clo...@googlegroups.com
 Note that posts from new members are moderated - please be patient
 with your first post.
 To unsubscribe from this group, send email to
 clojure+u...@googlegroups.com
 For more options, visit this group at
 http://groups.google.com/group/clojure?hl=en
 ---
 You received this message because you are subscribed to the Google
 Groups Clojure group.
 To unsubscribe from this group and stop receiving emails from it, send
 an email to clojure+u...@googlegroups.com.
 For more options, visit https://groups.google.com/d/optout.


  --
 You received this message because you are subscribed to the Google
 Groups Clojure group.
 To post to this group, send email to clo...@googlegroups.com
 Note that posts from new members are moderated - please be patient with
 your first post.
 To unsubscribe from this group, send email to
 clojure+u...@googlegroups.com
 For more options, visit this group at
 http://groups.google.com/group/clojure?hl=en
 ---
 You received this message because you are subscribed to the Google
 Groups Clojure group.
 To unsubscribe from this group and stop receiving emails from it, send
 an email to clojure+u...@googlegroups.com.
 For more options, visit https://groups.google.com/d/optout.


  --
 You received this message because you are subscribed to the Google
 Groups Clojure group.
 To post to this group, send email to clojure@googlegroups.com
 Note that posts from new members are moderated - please be patient with
 your first post.
 To unsubscribe from this group, send email to
 clojure+unsubscr...@googlegroups.com
 For more options, visit this 

Re: [OT?] Best DB/architecture for n-gram corpus?

2015-03-12 Thread Sam Raker
That's honestly closer to what I was originally envisioning--I've never 
really looked into graph dbs before, but I'll check out Neo4j tonight. Do 
you know whether you can model multiple edges between the same nodes? I'd 
love to be able to have POS-based wildcarding as a feature, so you could 
search for e.g. the ADJ goose, but that's a whole other layer of stuff, 
so it might go in the eventually, maybe pile.



On Tuesday, March 10, 2015 at 3:47:37 PM UTC-4, Ray Miller wrote:

 On 10 March 2015 at 17:58, Sam Raker sam@gmail.com javascript: 
 wrote: 
  I more meant deciding on a maximum size and storing them qua ngrams--it 
  seems limiting. On the other hand, after a certain size, they stop being 
  ngrams and start being something else--texts, possibly. 

 Exactly. When I first read your post, I almost suggested you model 
 this in a graph database like Neo4j or Titan. Each word would be a 
 node in the graph with an edge linking it to the next word in the 
 sentence. You could define an index on the words (so retrieving all 
 nodes for a given word would be fast), then follow edges to find and 
 count particular n-grams. This is more complicated than the relational 
 model I proposed, and will be a bit slower to query. But if you don't 
 want to put an upper-bound on the length of the n-gram when you index 
 the data, it might be the way to go. 

 Ray. 


-- 
You received this message because you are subscribed to the Google
Groups Clojure group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
Clojure group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to clojure+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: [OT?] Best DB/architecture for n-gram corpus?

2015-03-12 Thread Sam Raker
POS tagging is a solved-enough problem, at least in most domains. Clojure 
still doesn't have its own NLTK (HINT HINT, GSOC kids!), but I'm sure I can 
find a Java lib or 2 that should do the job well enough.

On Tuesday, March 10, 2015 at 4:27:17 PM UTC-4, Ray Miller wrote:

 On 10 March 2015 at 20:03, Sam Raker sam@gmail.com javascript: 
 wrote: 
  That's honestly closer to what I was originally envisioning--I've never 
  really looked into graph dbs before, but I'll check out Neo4j tonight. 
 Do 
  you know whether you can model multiple edges between the same nodes? 

 Yes, certainly possible. 

 If you go for Neo4j you have two options for Clojure: embedded (with 
 the borneo library and reading Javadoc, or plain Java interop) or 
 stand-alone server with REST API (with the well-documented Neocons 
 library from Clojurewerkz). You'll also have to think about how to 
 model which text (song) each phrase came from - likely another node 
 type in the graph with a linking edge to the start of the phrase. 

 Great book on Neo4j available for free download, also covers data 
 modelling: http://neo4j.com/books/ 

  I'd love to be able to have POS-based wildcarding as a feature, so you 
 could 
  search for e.g. the ADJ goose, but that's a whole other layer of 
 stuff, so 
  it might go in the eventually, maybe pile. 

 Sounds like fun, but means doing some natural language processing on 
 the input texts, which is a much more difficult problem than simply 
 tokenizing. 

 Ray. 


-- 
You received this message because you are subscribed to the Google
Groups Clojure group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
Clojure group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to clojure+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: [OT?] Best DB/architecture for n-gram corpus?

2015-03-11 Thread Pablo Nussembaum
Have looked at http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/ that 
comes with Postgres 9.4 and it's really really powerful and fast.

On 03/06/2015 09:25 PM, Sam Raker wrote:
 I'm trying to create an n-gram[1] corpus out of song lyrics. I'm breaking 
 individual songs into lines, which are then split into words, so you end up 
 with something like

 {0 {0 go 1 tell 2 aunt 3 rhodie} 1 {0 the 1 old 2 grey 3 
 goose 4 is 5 dead}...}

 (Yes, maps with integer keys is kind of dumb; I thought about using vectors, 
 but this is all going into MongoDB temporarily, and I'd rather just deal with 
 maps instead of messing with Mongo's
 somewhat lacking array-handling stuff.)

 The idea, ultimately, is to build a front-end that would allow users to, 
 e.g., search for all songs that contain the (sub)string aunt rhodie, or see 
 how many times The Rolling Stones use the word
 woman vs how many times the Beatles do, etc. The inspiration comes largely 
 from projects like COCA[2]. 

 I'm wondering if any of you have opinions about which database to use (Mongo 
 is most likely just a stopgap), and how best to architect it. I'm most 
 familiar with MySQL and Mongo, but I'd rather not
 be limited by just those two if there's a better option out there. I'm 
 thinking that I'll probably end up storing tokens over types--e.g., each word 
 would be stored individually, as opposed to
 having an entry for, e.g., the that stores every instance of the word 
 the. I was also thinking that I'll probably have to end up storing each 
 token's previous and next, either as full
 references or just as strings. This seems potentially inefficient, however. 

 (I could've just gone to StackOverflow with this, but figured I'm more likely 
 to get a real answer here, because you all seem so smart and nice?)


 Thanks!



 [1] https://en.wikipedia.org/wiki/N-gram
 [2] http://corpus.byu.edu/coca/
 -- 
 You received this message because you are subscribed to the Google
 Groups Clojure group.
 To post to this group, send email to clojure@googlegroups.com
 Note that posts from new members are moderated - please be patient with your 
 first post.
 To unsubscribe from this group, send email to
 clojure+unsubscr...@googlegroups.com
 For more options, visit this group at
 http://groups.google.com/group/clojure?hl=en
 ---
 You received this message because you are subscribed to the Google Groups 
 Clojure group.
 To unsubscribe from this group and stop receiving emails from it, send an 
 email to clojure+unsubscr...@googlegroups.com 
 mailto:clojure+unsubscr...@googlegroups.com.
 For more options, visit https://groups.google.com/d/optout.

-- 
You received this message because you are subscribed to the Google
Groups Clojure group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
Clojure group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to clojure+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: [OT?] Best DB/architecture for n-gram corpus?

2015-03-10 Thread John Wiseman
By hard coding n-grams, do you mean using the simple string
representation, e.g. aunt rhodie as the key in your database?  If so,
then maybe it helps to think of it from the perspective that it's not
really just text, it's a string that encodes an n-gram just like
[\aunt\, \rhodie\] is another way to encode an n-gram--the
encoding/decoding uses clojure.string/join and clojure.string/split instead
of json/write and json/read, and escaping tokens that contain spaces is on
your TODO list at a low priority :)

(And I think the Google n-gram corpus
https://catalog.ldc.upenn.edu/LDC2006T13 uses the same format.)


John


On Mon, Mar 9, 2015 at 7:09 PM, Sam Raker sam.ra...@gmail.com wrote:

 That's interesting. I've been really reluctant to hard code n-grams, but
 it's probably the best way to go.

 On Monday, March 9, 2015 at 6:12:43 PM UTC-4, John Wiseman wrote:

 One thing you can do is index 1, 2, 3...n-grams and use a simple  fast
 key-value store (like leveldb etc.)  e.g., you could have entries like

 aunt rhodie - song-9, song-44
 woman - song-12, song-65, song-96


 That's basically how I made the Metafilter N-gram Viewer
 http://mefingram.appspot.com/, a clone of Google Books Ngram Viewer
 https://books.google.com/ngrams.

 Another possibility is using Lucene.  Just be aware that Lucene calls
 n-grams of characters (au, un, nt) n-grams but it calls n-grams of
 words (that the, the old, old gray) shingles.  So you would end up
 using (I think, I haven't done this) the ShingleFilter
 https://lucene.apache.org/core/4_2_0/analyzers-common/org/apache/lucene/analysis/shingle/ShingleFilter.html
 .

 You might also find this article by Russ Cox interesting, where he
 describes building and using an inverted trigram index:
 http://swtch.com/~rsc/regexp/regexp4.html


 John





 Three things that you might find interesting:

 Russ Cox' explanation of doing indexing and retrieval with an inverted
 trigram index: http://swtch.com/~rsc/regexp/regexp4.html


 On Sat, Mar 7, 2015 at 3:22 AM, Matching Socks phill...@gmail.com
 wrote:

 A lot of guys would use Lucene.  Lucene calls n-grams of words
 shingles. [1]

 As for architecture, here is a suggestion to use Lucene to find keys
 to records in your real database. [2]

 [1] https://lucidworks.com/blog/whats-a-shingle-in-lucene-parlance/

 [2] https://groups.google.com/d/msg/datomic/8yrCYxcQq34/GIomGaarX5QJ


  --
 You received this message because you are subscribed to the Google
 Groups Clojure group.
 To post to this group, send email to clo...@googlegroups.com
 Note that posts from new members are moderated - please be patient with
 your first post.
 To unsubscribe from this group, send email to
 clojure+u...@googlegroups.com
 For more options, visit this group at
 http://groups.google.com/group/clojure?hl=en
 ---
 You received this message because you are subscribed to the Google
 Groups Clojure group.
 To unsubscribe from this group and stop receiving emails from it, send
 an email to clojure+u...@googlegroups.com.
 For more options, visit https://groups.google.com/d/optout.


  --
 You received this message because you are subscribed to the Google
 Groups Clojure group.
 To post to this group, send email to clojure@googlegroups.com
 Note that posts from new members are moderated - please be patient with
 your first post.
 To unsubscribe from this group, send email to
 clojure+unsubscr...@googlegroups.com
 For more options, visit this group at
 http://groups.google.com/group/clojure?hl=en
 ---
 You received this message because you are subscribed to the Google Groups
 Clojure group.
 To unsubscribe from this group and stop receiving emails from it, send an
 email to clojure+unsubscr...@googlegroups.com.
 For more options, visit https://groups.google.com/d/optout.


-- 
You received this message because you are subscribed to the Google
Groups Clojure group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
Clojure group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to clojure+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: [OT?] Best DB/architecture for n-gram corpus?

2015-03-10 Thread Sam Raker
I more meant deciding on a maximum size and storing them qua ngrams--it 
seems limiting. On the other hand, after a certain size, they stop being 
ngrams and start being something else--texts, possibly.

On Tuesday, March 10, 2015 at 1:29:44 PM UTC-4, John Wiseman wrote:

 By hard coding n-grams, do you mean using the simple string 
 representation, e.g. aunt rhodie as the key in your database?  If so, 
 then maybe it helps to think of it from the perspective that it's not 
 really just text, it's a string that encodes an n-gram just like 
 [\aunt\, \rhodie\] is another way to encode an n-gram--the 
 encoding/decoding uses clojure.string/join and clojure.string/split instead 
 of json/write and json/read, and escaping tokens that contain spaces is on 
 your TODO list at a low priority :)

 (And I think the Google n-gram corpus 
 https://catalog.ldc.upenn.edu/LDC2006T13 uses the same format.)


 John


 On Mon, Mar 9, 2015 at 7:09 PM, Sam Raker sam@gmail.com javascript:
  wrote:

 That's interesting. I've been really reluctant to hard code n-grams, 
 but it's probably the best way to go.

 On Monday, March 9, 2015 at 6:12:43 PM UTC-4, John Wiseman wrote:

 One thing you can do is index 1, 2, 3...n-grams and use a simple  fast 
 key-value store (like leveldb etc.)  e.g., you could have entries like

 aunt rhodie - song-9, song-44
 woman - song-12, song-65, song-96


 That's basically how I made the Metafilter N-gram Viewer 
 http://mefingram.appspot.com/, a clone of Google Books Ngram Viewer 
 https://books.google.com/ngrams.

 Another possibility is using Lucene.  Just be aware that Lucene calls 
 n-grams of characters (au, un, nt) n-grams but it calls n-grams of 
 words (that the, the old, old gray) shingles.  So you would end up 
 using (I think, I haven't done this) the ShingleFilter 
 https://lucene.apache.org/core/4_2_0/analyzers-common/org/apache/lucene/analysis/shingle/ShingleFilter.html
 .

 You might also find this article by Russ Cox interesting, where he 
 describes building and using an inverted trigram index: 
 http://swtch.com/~rsc/regexp/regexp4.html


 John





 Three things that you might find interesting:

 Russ Cox' explanation of doing indexing and retrieval with an inverted 
 trigram index: http://swtch.com/~rsc/regexp/regexp4.html


 On Sat, Mar 7, 2015 at 3:22 AM, Matching Socks phill...@gmail.com 
 wrote:

 A lot of guys would use Lucene.  Lucene calls n-grams of words 
 shingles. [1]

 As for architecture, here is a suggestion to use Lucene to find keys 
 to records in your real database. [2]

 [1] https://lucidworks.com/blog/whats-a-shingle-in-lucene-parlance/

 [2] https://groups.google.com/d/msg/datomic/8yrCYxcQq34/GIomGaarX5QJ


  -- 
 You received this message because you are subscribed to the Google
 Groups Clojure group.
 To post to this group, send email to clo...@googlegroups.com
 Note that posts from new members are moderated - please be patient with 
 your first post.
 To unsubscribe from this group, send email to
 clojure+u...@googlegroups.com
 For more options, visit this group at
 http://groups.google.com/group/clojure?hl=en
 --- 
 You received this message because you are subscribed to the Google 
 Groups Clojure group.
 To unsubscribe from this group and stop receiving emails from it, send 
 an email to clojure+u...@googlegroups.com.
 For more options, visit https://groups.google.com/d/optout.


  -- 
 You received this message because you are subscribed to the Google
 Groups Clojure group.
 To post to this group, send email to clo...@googlegroups.com 
 javascript:
 Note that posts from new members are moderated - please be patient with 
 your first post.
 To unsubscribe from this group, send email to
 clojure+u...@googlegroups.com javascript:
 For more options, visit this group at
 http://groups.google.com/group/clojure?hl=en
 --- 
 You received this message because you are subscribed to the Google Groups 
 Clojure group.
 To unsubscribe from this group and stop receiving emails from it, send an 
 email to clojure+u...@googlegroups.com javascript:.
 For more options, visit https://groups.google.com/d/optout.




-- 
You received this message because you are subscribed to the Google
Groups Clojure group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
Clojure group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to clojure+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: [OT?] Best DB/architecture for n-gram corpus?

2015-03-10 Thread Ray Miller
On 10 March 2015 at 20:03, Sam Raker sam.ra...@gmail.com wrote:
 That's honestly closer to what I was originally envisioning--I've never
 really looked into graph dbs before, but I'll check out Neo4j tonight. Do
 you know whether you can model multiple edges between the same nodes?

Yes, certainly possible.

If you go for Neo4j you have two options for Clojure: embedded (with
the borneo library and reading Javadoc, or plain Java interop) or
stand-alone server with REST API (with the well-documented Neocons
library from Clojurewerkz). You'll also have to think about how to
model which text (song) each phrase came from - likely another node
type in the graph with a linking edge to the start of the phrase.

Great book on Neo4j available for free download, also covers data
modelling: http://neo4j.com/books/

 I'd love to be able to have POS-based wildcarding as a feature, so you could
 search for e.g. the ADJ goose, but that's a whole other layer of stuff, so
 it might go in the eventually, maybe pile.

Sounds like fun, but means doing some natural language processing on
the input texts, which is a much more difficult problem than simply
tokenizing.

Ray.

-- 
You received this message because you are subscribed to the Google
Groups Clojure group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
Clojure group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to clojure+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: [OT?] Best DB/architecture for n-gram corpus?

2015-03-10 Thread Ray Miller
On 10 March 2015 at 17:58, Sam Raker sam.ra...@gmail.com wrote:
 I more meant deciding on a maximum size and storing them qua ngrams--it
 seems limiting. On the other hand, after a certain size, they stop being
 ngrams and start being something else--texts, possibly.

Exactly. When I first read your post, I almost suggested you model
this in a graph database like Neo4j or Titan. Each word would be a
node in the graph with an edge linking it to the next word in the
sentence. You could define an index on the words (so retrieving all
nodes for a given word would be fast), then follow edges to find and
count particular n-grams. This is more complicated than the relational
model I proposed, and will be a bit slower to query. But if you don't
want to put an upper-bound on the length of the n-gram when you index
the data, it might be the way to go.

Ray.

-- 
You received this message because you are subscribed to the Google
Groups Clojure group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
Clojure group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to clojure+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: [OT?] Best DB/architecture for n-gram corpus?

2015-03-09 Thread John Wiseman
One thing you can do is index 1, 2, 3...n-grams and use a simple  fast
key-value store (like leveldb etc.)  e.g., you could have entries like

aunt rhodie - song-9, song-44
woman - song-12, song-65, song-96


That's basically how I made the Metafilter N-gram Viewer
http://mefingram.appspot.com/, a clone of Google Books Ngram Viewer
https://books.google.com/ngrams.

Another possibility is using Lucene.  Just be aware that Lucene calls
n-grams of characters (au, un, nt) n-grams but it calls n-grams of
words (that the, the old, old gray) shingles.  So you would end up
using (I think, I haven't done this) the ShingleFilter
https://lucene.apache.org/core/4_2_0/analyzers-common/org/apache/lucene/analysis/shingle/ShingleFilter.html
.

You might also find this article by Russ Cox interesting, where he
describes building and using an inverted trigram index:
http://swtch.com/~rsc/regexp/regexp4.html


John





Three things that you might find interesting:

Russ Cox' explanation of doing indexing and retrieval with an inverted
trigram index: http://swtch.com/~rsc/regexp/regexp4.html


On Sat, Mar 7, 2015 at 3:22 AM, Matching Socks phill.w...@gmail.com wrote:

 A lot of guys would use Lucene.  Lucene calls n-grams of words shingles.
 [1]

 As for architecture, here is a suggestion to use Lucene to find keys to
 records in your real database. [2]

 [1] https://lucidworks.com/blog/whats-a-shingle-in-lucene-parlance/

 [2] https://groups.google.com/d/msg/datomic/8yrCYxcQq34/GIomGaarX5QJ


  --
 You received this message because you are subscribed to the Google
 Groups Clojure group.
 To post to this group, send email to clojure@googlegroups.com
 Note that posts from new members are moderated - please be patient with
 your first post.
 To unsubscribe from this group, send email to
 clojure+unsubscr...@googlegroups.com
 For more options, visit this group at
 http://groups.google.com/group/clojure?hl=en
 ---
 You received this message because you are subscribed to the Google Groups
 Clojure group.
 To unsubscribe from this group and stop receiving emails from it, send an
 email to clojure+unsubscr...@googlegroups.com.
 For more options, visit https://groups.google.com/d/optout.


-- 
You received this message because you are subscribed to the Google
Groups Clojure group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
Clojure group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to clojure+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: [OT?] Best DB/architecture for n-gram corpus?

2015-03-09 Thread Sam Raker
That's interesting. I've been really reluctant to hard code n-grams, but 
it's probably the best way to go.

On Monday, March 9, 2015 at 6:12:43 PM UTC-4, John Wiseman wrote:

 One thing you can do is index 1, 2, 3...n-grams and use a simple  fast 
 key-value store (like leveldb etc.)  e.g., you could have entries like

 aunt rhodie - song-9, song-44
 woman - song-12, song-65, song-96


 That's basically how I made the Metafilter N-gram Viewer 
 http://mefingram.appspot.com/, a clone of Google Books Ngram Viewer 
 https://books.google.com/ngrams.

 Another possibility is using Lucene.  Just be aware that Lucene calls 
 n-grams of characters (au, un, nt) n-grams but it calls n-grams of 
 words (that the, the old, old gray) shingles.  So you would end up 
 using (I think, I haven't done this) the ShingleFilter 
 https://lucene.apache.org/core/4_2_0/analyzers-common/org/apache/lucene/analysis/shingle/ShingleFilter.html
 .

 You might also find this article by Russ Cox interesting, where he 
 describes building and using an inverted trigram index: 
 http://swtch.com/~rsc/regexp/regexp4.html


 John





 Three things that you might find interesting:

 Russ Cox' explanation of doing indexing and retrieval with an inverted 
 trigram index: http://swtch.com/~rsc/regexp/regexp4.html


 On Sat, Mar 7, 2015 at 3:22 AM, Matching Socks phill...@gmail.com 
 javascript: wrote:

 A lot of guys would use Lucene.  Lucene calls n-grams of words 
 shingles. [1]

 As for architecture, here is a suggestion to use Lucene to find keys to 
 records in your real database. [2]

 [1] https://lucidworks.com/blog/whats-a-shingle-in-lucene-parlance/

 [2] https://groups.google.com/d/msg/datomic/8yrCYxcQq34/GIomGaarX5QJ


  -- 
 You received this message because you are subscribed to the Google
 Groups Clojure group.
 To post to this group, send email to clo...@googlegroups.com 
 javascript:
 Note that posts from new members are moderated - please be patient with 
 your first post.
 To unsubscribe from this group, send email to
 clojure+u...@googlegroups.com javascript:
 For more options, visit this group at
 http://groups.google.com/group/clojure?hl=en
 --- 
 You received this message because you are subscribed to the Google Groups 
 Clojure group.
 To unsubscribe from this group and stop receiving emails from it, send an 
 email to clojure+u...@googlegroups.com javascript:.
 For more options, visit https://groups.google.com/d/optout.




-- 
You received this message because you are subscribed to the Google
Groups Clojure group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
Clojure group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to clojure+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: [OT?] Best DB/architecture for n-gram corpus?

2015-03-07 Thread Ray Miller
On 7 March 2015 at 00:25, Sam Raker sam.ra...@gmail.com wrote:
 I'm trying to create an n-gram[1] corpus out of song lyrics. I'm breaking
 individual songs into lines,  which are then split into words, so you end up
 with something like

 {0 {0 go 1 tell 2 aunt 3 rhodie} 1 {0 the 1 old 2 grey 3
 goose 4 is 5 dead}...}

Why split into lines? In this example, rhodie the is just as valid a
bigram as tell aunt. It would be more natural to split at a sentence
boundary.

 (Yes, maps with integer keys is kind of dumb; I thought about using vectors,
 but this is all going into MongoDB temporarily, and I'd rather just deal
 with maps instead of messing with Mongo's somewhat lacking array-handling
 stuff.)

 The idea, ultimately, is to build a front-end that would allow users to,
 e.g., search for all songs that contain the (sub)string aunt rhodie, or
 see how many times The Rolling Stones use the word woman vs how many times
 the Beatles do, etc. The inspiration comes largely from projects like
 COCA[2].

 I'm wondering if any of you have opinions about which database to use (Mongo
 is most likely just a stopgap), and how best to architect it. I'm most
 familiar with MySQL and Mongo, but I'd rather not be limited by just those
 two if there's a better option out there.

First up, I think you'll likely want to trade space for speed, and the
simplest way to do this is to store every n-gram you're interested in.
This means deciding up-front the maximum size of the n-gram you're
interested in. You could fairly easily model this in any relational
database as:

tracks

id (serial, primary key)
name (text)

n_grams

id (serial, primary key)
n (integer)
n_gram (text)

track_n_gram
---
track (references tracks.id)
n_gram (references n_grams.id)
num_occurrences (integer)

With that schema, you should be able to answer all of the queries you
posed with some simple SQL.

Of course, this being the Clojure mailing list you should also
consider Datomic, and I think the above could be mapped to a Datomic
schema without too much effort.

Ray.

-- 
You received this message because you are subscribed to the Google
Groups Clojure group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
Clojure group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to clojure+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: [OT?] Best DB/architecture for n-gram corpus?

2015-03-07 Thread Matching Socks
A lot of guys would use Lucene.  Lucene calls n-grams of words shingles. 
[1]

As for architecture, here is a suggestion to use Lucene to find keys to 
records in your real database. [2]

[1] https://lucidworks.com/blog/whats-a-shingle-in-lucene-parlance/

[2] https://groups.google.com/d/msg/datomic/8yrCYxcQq34/GIomGaarX5QJ


-- 
You received this message because you are subscribed to the Google
Groups Clojure group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
Clojure group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to clojure+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[OT?] Best DB/architecture for n-gram corpus?

2015-03-06 Thread Sam Raker
I'm trying to create an n-gram[1] corpus out of song lyrics. I'm breaking 
individual songs into lines, which are then split into words, so you end up 
with something like

{0 {0 go 1 tell 2 aunt 3 rhodie} 1 {0 the 1 old 2 grey 3 
goose 4 is 5 dead}...}

(Yes, maps with integer keys is kind of dumb; I thought about using 
vectors, but this is all going into MongoDB temporarily, and I'd rather 
just deal with maps instead of messing with Mongo's somewhat lacking 
array-handling stuff.)

The idea, ultimately, is to build a front-end that would allow users to, 
e.g., search for all songs that contain the (sub)string aunt rhodie, or 
see how many times The Rolling Stones use the word woman vs how many 
times the Beatles do, etc. The inspiration comes largely from projects like 
COCA[2]. 

I'm wondering if any of you have opinions about which database to use 
(Mongo is most likely just a stopgap), and how best to architect it. I'm 
most familiar with MySQL and Mongo, but I'd rather not be limited by just 
those two if there's a better option out there. I'm thinking that I'll 
probably end up storing tokens over types--e.g., each word would be stored 
individually, as opposed to having an entry for, e.g., the that stores 
every instance of the word the. I was also thinking that I'll probably 
have to end up storing each token's previous and next, either as full 
references or just as strings. This seems potentially inefficient, however. 

(I could've just gone to StackOverflow with this, but figured I'm more 
likely to get a real answer here, because you all seem so smart and nice?)


Thanks!



[1] https://en.wikipedia.org/wiki/N-gram
[2] http://corpus.byu.edu/coca/

-- 
You received this message because you are subscribed to the Google
Groups Clojure group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
Clojure group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to clojure+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.