words=: ;: NB. might change this because punctuation handling

pc=: 4 :0
  w=. x <@(;:inv)\ words y
  n=. #/.~ w
  o=. \: n
  (<"0 o{n),:o{~.w
)

I hope this helps,

-- 
Raul


On Thu, Jan 18, 2018 at 3:23 AM, Skip Cave <[email protected]> wrote:
> I'm working on some Natural Language Processing algorithms.
>
> I built
> a
> basic
> set of
> word count verbs:
>
>
>
> NB. Test phrase
> :
>
>
>
> b =. 'the cat in the hat ate a hat and saw another cat in a hat in the hat'
>
>
>
> NB. Word count
>
>
>  wc =.3 :'#/.~;:y'
>
>
>
> NB. Labeled word count
>
>
>  lwc =.3 :'|:(;/#/.~;:y),.~.;:y'
>
>
> NB. Sorted &
> l
> abeled word count
>
> slwc =.3 :' (\:wc y){"1 lwc y'
>
> slwc b
> ┌───┬───┬──┬───┬─┬───┬───┬───┬───────┐
> │4
>
> │
>
> 3
>
>  │3 │2
>
> │2│1
>
> │1
>
> │1
>
> │1
>
> │
> ├───┼───┼──┼───┼─┼───┼───┼───┼───────┤
> │hat│the│in│cat│a│ate│and│saw│another│
> └───┴───┴──┴───┴─┴───┴───┴───┴───────┘
>
> Now I want to do the same thing for 2-word sequences (phrases) with a
> sliding window:
> |the cat|cat in|in the|the hat| .... etc.
> with wrap around the end:
> |the hat|hat the|the cat| .... etc.
>
> And 3-word sequences:
> |the cat in|cat in the|in the hat|.... etc.
> with wrap around the end:
> |in the hat|the hat the|hat the cat| ... etc
>
> And 4-word sequences, ... etc.
>
> Ideally, I would like a generalized phrase-count verb with the format:
>
> NB. Phrase count verb format:
> NB.  x pc y
> NB.  x= number of words in the phrase to be counted
> NB.  y= the text to be processed
>
> The output layout should be the same for all n-sequence counts - a 2-row
> sorted list of the boxed counts, on top of the associated boxed word
> sequence.
>
> Skip
>
> Skip Cave
> Cave Consulting LLC
> ----------------------------------------------------------------------
> For information about J forums see http://www.jsoftware.com/forums.htm
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Reply via email to