You should try to engage Brian Becker at Dyalog APL on this topic.

Thanks,

-- 
Raul


On Fri, Jul 3, 2015 at 9:26 PM, 'Pascal Jasmin' via Programming
<[email protected]> wrote:
> A cool use of symbols, not obvious to me until today, is a word dictionary 
> used to test if some other input is in dictionary or not.
>
> using this list on clipboard
>
> https://gist.githubusercontent.com/Quackmatic/512736d51d84277594f2/raw/words
>
>
>
> words =: s: (;: 'a i o'),  cutLF wdclippaste ''  NB.(adding 1 letter words)
>
> here is a list of gibberish sentences that contain 3 real sentences
>
> https://gist.githubusercontent.com/anonymous/c8fb349e9ae4fcb40cb5/raw/05a1ef03626057e1b57b5bbdddc4c2373ce4b465/challenge.txt
>
> with that new list on the clipboard
>
> a =: (<', . ? ! : ; ') rplc~ each cutLF wdclippaste ''
>
> the following is not terrible (3 seconds or so).  filters lines where 80% of 
> words are in dictionary.
>
>> (] #~ 0.8 < [: (+/%#) every (words e.~ s:@:;: )each)  a
>
> Is there a way to make it faster?
>
> I was thinking that a sparse array 2d, where first index is the letter count 
> and 2nd index is crc32 and the element value is 1 if the value exists in 
> dictionary and 0 otherwise could be ok.
>
> even simpler would be a list of valid crc32 values since 60k/4B is going to 
> have a very low false positive rate well suited to testing to an 80% 
> threshold.
>
> Would those approaches be faster?
> ----------------------------------------------------------------------
> For information about J forums see http://www.jsoftware.com/forums.htm
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Reply via email to