Hello Pascal. 
I read your message with interest, but alas I understand but very little. 
Are your files ascii text files? How are they read into J? How are they 
transferred into symbols? Do you utilize the fact that the dictionary file is 
sorted? Is there a wiki where such topics are explained?
Thank you!
Bo.
 


     Den 4:45 lørdag den 4. juli 2015 skrev 'Pascal Jasmin' via Programming 
<[email protected]>:
   
 

 it does not appear that working with lists of crcs is any faster

instead of words... ie dictionary LF data on clipboard:

 w =. 1 2 3 4 5 6 7 8 9 10 <@(] #~ [ = #@:] every)"0 _ (;: 'a i o'),  cutLF 
wdclippaste ''

above just keeps 10 letter words or less, and places them in 10 boxes

10 boxes of lists of CRCs (for each word size)

wn =. /:~@:(128!:3 every) each w

a is same as previous quoted message

>  (]#~0.8<[:(+/%#)every(((wn{::~ :: 0: <:@#)e.~128!:3)every@:;:)each) a

takes over 12 seconds.  This was faster than 1 big list of CRCs


----- Original Message -----
From: 'Pascal Jasmin' via Programming <[email protected]>
To: Programming Forum <[email protected]>
Cc: 
Sent: Friday, July 3, 2015 9:26 PM
Subject: [Jprogramming] symbol speed

A cool use of symbols, not obvious to me until today, is a word dictionary used 
to test if some other input is in dictionary or not.

using this list on clipboard

https://gist.githubusercontent.com/Quackmatic/512736d51d84277594f2/raw/words



words =: s: (;: 'a i o'),  cutLF wdclippaste ''  NB.(adding 1 letter words)

here is a list of gibberish sentences that contain 3 real sentences

https://gist.githubusercontent.com/anonymous/c8fb349e9ae4fcb40cb5/raw/05a1ef03626057e1b57b5bbdddc4c2373ce4b465/challenge.txt

with that new list on the clipboard

a =: (<', . ? ! : ; ') rplc~ each cutLF wdclippaste ''

the following is not terrible (3 seconds or so).  filters lines where 80% of 
words are in dictionary.

> (] #~ 0.8 < [: (+/%#) every (words e.~ s:@:;: )each)  a

Is there a way to make it faster?

I was thinking that a sparse array 2d, where first index is the letter count 
and 2nd index is crc32 and the element value is 1 if the value exists in 
dictionary and 0 otherwise could be ok.

even simpler would be a list of valid crc32 values since 60k/4B is going to 
have a very low false positive rate well suited to testing to an 80% threshold.

Would those approaches be faster?
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm


 
  
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Reply via email to