I more meant deciding on a maximum size and storing them qua ngrams--it seems limiting. On the other hand, after a certain size, they stop being ngrams and start being something else--"texts," possibly.
On Tuesday, March 10, 2015 at 1:29:44 PM UTC-4, John Wiseman wrote: > > By "hard coding" n-grams, do you mean using the simple string > representation, e.g. "aunt rhodie" as the key in your database? If so, > then maybe it helps to think of it from the perspective that it's not > really just text, it's a string that encodes an n-gram just like > "[\"aunt\", \"rhodie\"]" is another way to encode an n-gram--the > encoding/decoding uses clojure.string/join and clojure.string/split instead > of json/write and json/read, and escaping tokens that contain spaces is on > your TODO list at a low priority :) > > (And I think the Google n-gram corpus > <https://catalog.ldc.upenn.edu/LDC2006T13> uses the same format.) > > > John > > > On Mon, Mar 9, 2015 at 7:09 PM, Sam Raker <sam....@gmail.com <javascript:> > > wrote: > >> That's interesting. I've been really reluctant to "hard code" n-grams, >> but it's probably the best way to go. >> >> On Monday, March 9, 2015 at 6:12:43 PM UTC-4, John Wiseman wrote: >>> >>> One thing you can do is index 1, 2, 3...n-grams and use a simple & fast >>> key-value store (like leveldb etc.) e.g., you could have entries like >>> >>> "aunt rhodie" -> song-9, song-44 >>> "woman" -> song-12, song-65, song-96 >>> >>> >>> That's basically how I made the Metafilter N-gram Viewer >>> <http://mefingram.appspot.com/>, a clone of Google Books Ngram Viewer >>> <https://books.google.com/ngrams>. >>> >>> Another possibility is using Lucene. Just be aware that Lucene calls >>> n-grams of characters ("au", "un", "nt") n-grams but it calls n-grams of >>> words ("that the", "the old", "old gray") shingles. So you would end up >>> using (I think, I haven't done this) the ShingleFilter >>> <https://lucene.apache.org/core/4_2_0/analyzers-common/org/apache/lucene/analysis/shingle/ShingleFilter.html> >>> . >>> >>> You might also find this article by Russ Cox interesting, where he >>> describes building and using an inverted trigram index: >>> http://swtch.com/~rsc/regexp/regexp4.html >>> >>> >>> John >>> >>> >>> >>> >>> >>> Three things that you might find interesting: >>> >>> Russ Cox' explanation of doing indexing and retrieval with an inverted >>> trigram index: http://swtch.com/~rsc/regexp/regexp4.html >>> >>> >>> On Sat, Mar 7, 2015 at 3:22 AM, Matching Socks <phill...@gmail.com> >>> wrote: >>> >>>> A lot of guys would use Lucene. Lucene calls n-grams of words >>>> "shingles". [1] >>>> >>>> As for "architecture", here is a suggestion to use Lucene to find keys >>>> to records in your "real" database. [2] >>>> >>>> [1] https://lucidworks.com/blog/whats-a-shingle-in-lucene-parlance/ >>>> >>>> [2] https://groups.google.com/d/msg/datomic/8yrCYxcQq34/GIomGaarX5QJ >>>> >>>> >>>> -- >>>> You received this message because you are subscribed to the Google >>>> Groups "Clojure" group. >>>> To post to this group, send email to clo...@googlegroups.com >>>> Note that posts from new members are moderated - please be patient with >>>> your first post. >>>> To unsubscribe from this group, send email to >>>> clojure+u...@googlegroups.com >>>> For more options, visit this group at >>>> http://groups.google.com/group/clojure?hl=en >>>> --- >>>> You received this message because you are subscribed to the Google >>>> Groups "Clojure" group. >>>> To unsubscribe from this group and stop receiving emails from it, send >>>> an email to clojure+u...@googlegroups.com. >>>> For more options, visit https://groups.google.com/d/optout. >>>> >>> >>> -- >> You received this message because you are subscribed to the Google >> Groups "Clojure" group. >> To post to this group, send email to clo...@googlegroups.com >> <javascript:> >> Note that posts from new members are moderated - please be patient with >> your first post. >> To unsubscribe from this group, send email to >> clojure+u...@googlegroups.com <javascript:> >> For more options, visit this group at >> http://groups.google.com/group/clojure?hl=en >> --- >> You received this message because you are subscribed to the Google Groups >> "Clojure" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to clojure+u...@googlegroups.com <javascript:>. >> For more options, visit https://groups.google.com/d/optout. >> > > -- You received this message because you are subscribed to the Google Groups "Clojure" group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en --- You received this message because you are subscribed to the Google Groups "Clojure" group. To unsubscribe from this group and stop receiving emails from it, send an email to clojure+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.