The word/phrase count verbs Raul & Mike propose work great. However, I
realized that I need a way to accumulate word/phrase counts over many
sentences. What is a reasonable architecture for creating a set of
word/phrase-count dictionaries, one dictionary each for single words, two
words, three words, and four words. An "addcount" verb would allow me to
keep adding the word/phrase counts of new sentences to the current
dictionaries, as I process each sentence?

I envision four boxed nouns, each containing sorted lists of
different-length phrases and their counts.
The noun dictionaries would be labeled: 1word, 2word, 3word, and 4word.

Skip

Skip Cave
Cave Consulting LLC

On Thu, Jan 18, 2018 at 10:08 AM, Raul Miller <[email protected]> wrote:

> Ah, yes, I missed the bit about wrapping. I was in a hurry to get out
> the door and glossed over that part.
>
> Still, that's simple to add:
>
> words=: ;: NB. might change this because punctuation handling
>
> pc=: 4 :0
>   w=. x <@(;:inv)\ ($~ _1+x+#) words y
>   n=. #/.~ w
>   o=. \: n
>   (<"0 o{n),:o{~.w
> )
>
> Not quite the same implementation as your xwrap, but I think I prefer
> using reshape for something like this.
>
> Thanks,
>
> --
> Raul
>
>
> On Thu, Jan 18, 2018 at 9:56 AM, 'Mike Day' via Programming
> <[email protected]> wrote:
> > Raul's explicit verb is more readable,  than the following,  but I think
> > he's
> > overlooked your requirement for word-wrapping.
> >
> > As I understand that little extra,  one only needs to wrap 1 fewer words
> > than the
> > group-size.  I chose to wrap them at the end rather than at the front,
> > which your
> > examples portrayed.
> >
> > I've assumed ;: is sufficient for the time being.
> >
> > xwrap =: ([ , ({.~ <:))~    NB. tack on (x-1) words at end of phrase
> >
> > xgroup=: [ <@:(;:^:_1)\ ]   NB. form x-sized groups
> >
> > gwc   =: ({"1~ \:@{.)@(~. ,:~ <@#/.~) NB. dec sort numbers and nub of
> groups
> >
> > wordcount =: gwc@([ xgroup [ xwrap ;:@]) NB. combine the verbs
> >
> >
> >    5{."1 ]3 wordcount b  NB. 5{. to avoid email word-wrapping!?
> > +----------+----------+----------+-----------+---------+
> > |2         |1         |1         |1          |1        |
> > +----------+----------+----------+-----------+---------+
> > |in the hat|the cat in|cat in the|the hat ate|hat ate a|
> > +----------+----------+----------+-----------+---------+
> >
> > Might help a bit further,
> >
> >
> > Mike
> >
> >
> >
> >
> >
> > On 18/01/2018 08:23, Skip Cave wrote:
> >>
> >> I'm working on some Natural Language Processing algorithms.
> >>
> >> I built
> >> a
> >> basic
> >> set of
> >> word count verbs:
> >>
> >>
> >>
> >> NB. Test phrase
> >> :
> >>
> >>
> >>
> >> b =. 'the cat in the hat ate a hat and saw another cat in a hat in the
> >> hat'
> >>
> >>
> >>
> >> NB. Word count
> >>
> >>
> >>   wc =.3 :'#/.~;:y'
> >>
> >>
> >>
> >> NB. Labeled word count
> >>
> >>
> >>   lwc =.3 :'|:(;/#/.~;:y),.~.;:y'
> >>
> >>
> >> NB. Sorted &
> >> l
> >> abeled word count
> >>
> >> slwc =.3 :' (\:wc y){"1 lwc y'
> >>
> >> slwc b
> >> ┌───┬───┬──┬───┬─┬───┬───┬───┬───────┐
> >> │4
> >>
> >> │
> >>
> >> 3
> >>
> >>   │3 │2
> >>
> >> │2│1
> >>
> >> │1
> >>
> >> │1
> >>
> >> │1
> >>
> >> │
> >> ├───┼───┼──┼───┼─┼───┼───┼───┼───────┤
> >> │hat│the│in│cat│a│ate│and│saw│another│
> >> └───┴───┴──┴───┴─┴───┴───┴───┴───────┘
> >>
> >> Now I want to do the same thing for 2-word sequences (phrases) with a
> >> sliding window:
> >> |the cat|cat in|in the|the hat| .... etc.
> >> with wrap around the end:
> >> |the hat|hat the|the cat| .... etc.
> >>
> >> And 3-word sequences:
> >> |the cat in|cat in the|in the hat|.... etc.
> >> with wrap around the end:
> >> |in the hat|the hat the|hat the cat| ... etc
> >>
> >> And 4-word sequences, ... etc.
> >>
> >> Ideally, I would like a generalized phrase-count verb with the format:
> >>
> >> NB. Phrase count verb format:
> >> NB.  x pc y
> >> NB.  x= number of words in the phrase to be counted
> >> NB.  y= the text to be processed
> >>
> >> The output layout should be the same for all n-sequence counts - a 2-row
> >> sorted list of the boxed counts, on top of the associated boxed word
> >> sequence.
> >>
> >> Skip
> >>
> >> Skip Cave
> >> Cave Consulting LLC
> >> ----------------------------------------------------------------------
> >> For information about J forums see http://www.jsoftware.com/forums.htm
> >
> >
> >
> >
> > ---
> > This email has been checked for viruses by Avast antivirus software.
> > https://www.avast.com/antivirus
> >
> > ----------------------------------------------------------------------
> > For information about J forums see http://www.jsoftware.com/forums.htm
> ----------------------------------------------------------------------
> For information about J forums see http://www.jsoftware.com/forums.htm
>
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Reply via email to