Hello Pascal.
I read your message with interest, but alas I understand but very little.
Are your files ascii text files? How are they read into J? How are they
transferred into symbols? Do you utilize the fact that the dictionary file is
sorted? Is there a wiki where such topics are explained?
Thank you!
Bo.
Den 4:45 lørdag den 4. juli 2015 skrev 'Pascal Jasmin' via Programming
<[email protected]>:
it does not appear that working with lists of crcs is any faster
instead of words... ie dictionary LF data on clipboard:
w =. 1 2 3 4 5 6 7 8 9 10 <@(] #~ [ = #@:] every)"0 _ (;: 'a i o'), cutLF
wdclippaste ''
above just keeps 10 letter words or less, and places them in 10 boxes
10 boxes of lists of CRCs (for each word size)
wn =. /:~@:(128!:3 every) each w
a is same as previous quoted message
> (]#~0.8<[:(+/%#)every(((wn{::~ :: 0: <:@#)e.~128!:3)every@:;:)each) a
takes over 12 seconds. This was faster than 1 big list of CRCs
----- Original Message -----
From: 'Pascal Jasmin' via Programming <[email protected]>
To: Programming Forum <[email protected]>
Cc:
Sent: Friday, July 3, 2015 9:26 PM
Subject: [Jprogramming] symbol speed
A cool use of symbols, not obvious to me until today, is a word dictionary used
to test if some other input is in dictionary or not.
using this list on clipboard
https://gist.githubusercontent.com/Quackmatic/512736d51d84277594f2/raw/words
words =: s: (;: 'a i o'), cutLF wdclippaste '' NB.(adding 1 letter words)
here is a list of gibberish sentences that contain 3 real sentences
https://gist.githubusercontent.com/anonymous/c8fb349e9ae4fcb40cb5/raw/05a1ef03626057e1b57b5bbdddc4c2373ce4b465/challenge.txt
with that new list on the clipboard
a =: (<', . ? ! : ; ') rplc~ each cutLF wdclippaste ''
the following is not terrible (3 seconds or so). filters lines where 80% of
words are in dictionary.
> (] #~ 0.8 < [: (+/%#) every (words e.~ s:@:;: )each) a
Is there a way to make it faster?
I was thinking that a sparse array 2d, where first index is the letter count
and 2nd index is crc32 and the element value is 1 if the value exists in
dictionary and 0 otherwise could be ok.
even simpler would be a list of valid crc32 values since 60k/4B is going to
have a very low false positive rate well suited to testing to an 80% threshold.
Would those approaches be faster?
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm