The word/phrase count verbs Raul & Mike propose work great. However, I realized that I need a way to accumulate word/phrase counts over many sentences. What is a reasonable architecture for creating a set of word/phrase-count dictionaries, one dictionary each for single words, two words, three words, and four words. An "addcount" verb would allow me to keep adding the word/phrase counts of new sentences to the current dictionaries, as I process each sentence?
I envision four boxed nouns, each containing sorted lists of different-length phrases and their counts. The noun dictionaries would be labeled: 1word, 2word, 3word, and 4word. Skip Skip Cave Cave Consulting LLC On Thu, Jan 18, 2018 at 10:08 AM, Raul Miller <[email protected]> wrote: > Ah, yes, I missed the bit about wrapping. I was in a hurry to get out > the door and glossed over that part. > > Still, that's simple to add: > > words=: ;: NB. might change this because punctuation handling > > pc=: 4 :0 > w=. x <@(;:inv)\ ($~ _1+x+#) words y > n=. #/.~ w > o=. \: n > (<"0 o{n),:o{~.w > ) > > Not quite the same implementation as your xwrap, but I think I prefer > using reshape for something like this. > > Thanks, > > -- > Raul > > > On Thu, Jan 18, 2018 at 9:56 AM, 'Mike Day' via Programming > <[email protected]> wrote: > > Raul's explicit verb is more readable, than the following, but I think > > he's > > overlooked your requirement for word-wrapping. > > > > As I understand that little extra, one only needs to wrap 1 fewer words > > than the > > group-size. I chose to wrap them at the end rather than at the front, > > which your > > examples portrayed. > > > > I've assumed ;: is sufficient for the time being. > > > > xwrap =: ([ , ({.~ <:))~ NB. tack on (x-1) words at end of phrase > > > > xgroup=: [ <@:(;:^:_1)\ ] NB. form x-sized groups > > > > gwc =: ({"1~ \:@{.)@(~. ,:~ <@#/.~) NB. dec sort numbers and nub of > groups > > > > wordcount =: gwc@([ xgroup [ xwrap ;:@]) NB. combine the verbs > > > > > > 5{."1 ]3 wordcount b NB. 5{. to avoid email word-wrapping!? > > +----------+----------+----------+-----------+---------+ > > |2 |1 |1 |1 |1 | > > +----------+----------+----------+-----------+---------+ > > |in the hat|the cat in|cat in the|the hat ate|hat ate a| > > +----------+----------+----------+-----------+---------+ > > > > Might help a bit further, > > > > > > Mike > > > > > > > > > > > > On 18/01/2018 08:23, Skip Cave wrote: > >> > >> I'm working on some Natural Language Processing algorithms. > >> > >> I built > >> a > >> basic > >> set of > >> word count verbs: > >> > >> > >> > >> NB. Test phrase > >> : > >> > >> > >> > >> b =. 'the cat in the hat ate a hat and saw another cat in a hat in the > >> hat' > >> > >> > >> > >> NB. Word count > >> > >> > >> wc =.3 :'#/.~;:y' > >> > >> > >> > >> NB. Labeled word count > >> > >> > >> lwc =.3 :'|:(;/#/.~;:y),.~.;:y' > >> > >> > >> NB. Sorted & > >> l > >> abeled word count > >> > >> slwc =.3 :' (\:wc y){"1 lwc y' > >> > >> slwc b > >> ┌───┬───┬──┬───┬─┬───┬───┬───┬───────┐ > >> │4 > >> > >> │ > >> > >> 3 > >> > >> │3 │2 > >> > >> │2│1 > >> > >> │1 > >> > >> │1 > >> > >> │1 > >> > >> │ > >> ├───┼───┼──┼───┼─┼───┼───┼───┼───────┤ > >> │hat│the│in│cat│a│ate│and│saw│another│ > >> └───┴───┴──┴───┴─┴───┴───┴───┴───────┘ > >> > >> Now I want to do the same thing for 2-word sequences (phrases) with a > >> sliding window: > >> |the cat|cat in|in the|the hat| .... etc. > >> with wrap around the end: > >> |the hat|hat the|the cat| .... etc. > >> > >> And 3-word sequences: > >> |the cat in|cat in the|in the hat|.... etc. > >> with wrap around the end: > >> |in the hat|the hat the|hat the cat| ... etc > >> > >> And 4-word sequences, ... etc. > >> > >> Ideally, I would like a generalized phrase-count verb with the format: > >> > >> NB. Phrase count verb format: > >> NB. x pc y > >> NB. x= number of words in the phrase to be counted > >> NB. y= the text to be processed > >> > >> The output layout should be the same for all n-sequence counts - a 2-row > >> sorted list of the boxed counts, on top of the associated boxed word > >> sequence. > >> > >> Skip > >> > >> Skip Cave > >> Cave Consulting LLC > >> ---------------------------------------------------------------------- > >> For information about J forums see http://www.jsoftware.com/forums.htm > > > > > > > > > > --- > > This email has been checked for viruses by Avast antivirus software. > > https://www.avast.com/antivirus > > > > ---------------------------------------------------------------------- > > For information about J forums see http://www.jsoftware.com/forums.htm > ---------------------------------------------------------------------- > For information about J forums see http://www.jsoftware.com/forums.htm > ---------------------------------------------------------------------- For information about J forums see http://www.jsoftware.com/forums.htm
