I more meant deciding on a maximum size and storing them qua ngrams--it 
seems limiting. On the other hand, after a certain size, they stop being 
ngrams and start being something else--"texts," possibly.

On Tuesday, March 10, 2015 at 1:29:44 PM UTC-4, John Wiseman wrote:
>
> By "hard coding" n-grams, do you mean using the simple string 
> representation, e.g. "aunt rhodie" as the key in your database?  If so, 
> then maybe it helps to think of it from the perspective that it's not 
> really just text, it's a string that encodes an n-gram just like 
> "[\"aunt\", \"rhodie\"]" is another way to encode an n-gram--the 
> encoding/decoding uses clojure.string/join and clojure.string/split instead 
> of json/write and json/read, and escaping tokens that contain spaces is on 
> your TODO list at a low priority :)
>
> (And I think the Google n-gram corpus 
> <https://catalog.ldc.upenn.edu/LDC2006T13> uses the same format.)
>
>
> John
>
>
> On Mon, Mar 9, 2015 at 7:09 PM, Sam Raker <sam....@gmail.com <javascript:>
> > wrote:
>
>> That's interesting. I've been really reluctant to "hard code" n-grams, 
>> but it's probably the best way to go.
>>
>> On Monday, March 9, 2015 at 6:12:43 PM UTC-4, John Wiseman wrote:
>>>
>>> One thing you can do is index 1, 2, 3...n-grams and use a simple & fast 
>>> key-value store (like leveldb etc.)  e.g., you could have entries like
>>>
>>> "aunt rhodie" -> song-9, song-44
>>> "woman" -> song-12, song-65, song-96
>>>
>>>
>>> That's basically how I made the Metafilter N-gram Viewer 
>>> <http://mefingram.appspot.com/>, a clone of Google Books Ngram Viewer 
>>> <https://books.google.com/ngrams>.
>>>
>>> Another possibility is using Lucene.  Just be aware that Lucene calls 
>>> n-grams of characters ("au", "un", "nt") n-grams but it calls n-grams of 
>>> words ("that the", "the old", "old gray") shingles.  So you would end up 
>>> using (I think, I haven't done this) the ShingleFilter 
>>> <https://lucene.apache.org/core/4_2_0/analyzers-common/org/apache/lucene/analysis/shingle/ShingleFilter.html>
>>> .
>>>
>>> You might also find this article by Russ Cox interesting, where he 
>>> describes building and using an inverted trigram index: 
>>> http://swtch.com/~rsc/regexp/regexp4.html
>>>
>>>
>>> John
>>>
>>>
>>>
>>>
>>>
>>> Three things that you might find interesting:
>>>
>>> Russ Cox' explanation of doing indexing and retrieval with an inverted 
>>> trigram index: http://swtch.com/~rsc/regexp/regexp4.html
>>>
>>>
>>> On Sat, Mar 7, 2015 at 3:22 AM, Matching Socks <phill...@gmail.com> 
>>> wrote:
>>>
>>>> A lot of guys would use Lucene.  Lucene calls n-grams of words 
>>>> "shingles". [1]
>>>>
>>>> As for "architecture", here is a suggestion to use Lucene to find keys 
>>>> to records in your "real" database. [2]
>>>>
>>>> [1] https://lucidworks.com/blog/whats-a-shingle-in-lucene-parlance/
>>>>
>>>> [2] https://groups.google.com/d/msg/datomic/8yrCYxcQq34/GIomGaarX5QJ
>>>>
>>>>
>>>>  -- 
>>>> You received this message because you are subscribed to the Google
>>>> Groups "Clojure" group.
>>>> To post to this group, send email to clo...@googlegroups.com
>>>> Note that posts from new members are moderated - please be patient with 
>>>> your first post.
>>>> To unsubscribe from this group, send email to
>>>> clojure+u...@googlegroups.com
>>>> For more options, visit this group at
>>>> http://groups.google.com/group/clojure?hl=en
>>>> --- 
>>>> You received this message because you are subscribed to the Google 
>>>> Groups "Clojure" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>> an email to clojure+u...@googlegroups.com.
>>>> For more options, visit https://groups.google.com/d/optout.
>>>>
>>>
>>>  -- 
>> You received this message because you are subscribed to the Google
>> Groups "Clojure" group.
>> To post to this group, send email to clo...@googlegroups.com 
>> <javascript:>
>> Note that posts from new members are moderated - please be patient with 
>> your first post.
>> To unsubscribe from this group, send email to
>> clojure+u...@googlegroups.com <javascript:>
>> For more options, visit this group at
>> http://groups.google.com/group/clojure?hl=en
>> --- 
>> You received this message because you are subscribed to the Google Groups 
>> "Clojure" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to clojure+u...@googlegroups.com <javascript:>.
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>

-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
"Clojure" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to clojure+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to