Raul's explicit verb is more readable, than the following, but I think
he's
overlooked your requirement for word-wrapping.
As I understand that little extra, one only needs to wrap 1 fewer words
than the
group-size. I chose to wrap them at the end rather than at the front,
which your
examples portrayed.
I've assumed ;: is sufficient for the time being.
xwrap =: ([ , ({.~ <:))~ NB. tack on (x-1) words at end of phrase
xgroup=: [ <@:(;:^:_1)\ ] NB. form x-sized groups
gwc =: ({"1~ \:@{.)@(~. ,:~ <@#/.~) NB. dec sort numbers and nub of
groups
wordcount =: gwc@([ xgroup [ xwrap ;:@]) NB. combine the verbs
5{."1 ]3 wordcount b NB. 5{. to avoid email word-wrapping!?
+----------+----------+----------+-----------+---------+
|2 |1 |1 |1 |1 |
+----------+----------+----------+-----------+---------+
|in the hat|the cat in|cat in the|the hat ate|hat ate a|
+----------+----------+----------+-----------+---------+
Might help a bit further,
Mike
On 18/01/2018 08:23, Skip Cave wrote:
I'm working on some Natural Language Processing algorithms.
I built
a
basic
set of
word count verbs:
NB. Test phrase
:
b =. 'the cat in the hat ate a hat and saw another cat in a hat in the hat'
NB. Word count
wc =.3 :'#/.~;:y'
NB. Labeled word count
lwc =.3 :'|:(;/#/.~;:y),.~.;:y'
NB. Sorted &
l
abeled word count
slwc =.3 :' (\:wc y){"1 lwc y'
slwc b
┌───┬───┬──┬───┬─┬───┬───┬───┬───────┐
│4
│
3
│3 │2
│2│1
│1
│1
│1
│
├───┼───┼──┼───┼─┼───┼───┼───┼───────┤
│hat│the│in│cat│a│ate│and│saw│another│
└───┴───┴──┴───┴─┴───┴───┴───┴───────┘
Now I want to do the same thing for 2-word sequences (phrases) with a
sliding window:
|the cat|cat in|in the|the hat| .... etc.
with wrap around the end:
|the hat|hat the|the cat| .... etc.
And 3-word sequences:
|the cat in|cat in the|in the hat|.... etc.
with wrap around the end:
|in the hat|the hat the|hat the cat| ... etc
And 4-word sequences, ... etc.
Ideally, I would like a generalized phrase-count verb with the format:
NB. Phrase count verb format:
NB. x pc y
NB. x= number of words in the phrase to be counted
NB. y= the text to be processed
The output layout should be the same for all n-sequence counts - a 2-row
sorted list of the boxed counts, on top of the associated boxed word
sequence.
Skip
Skip Cave
Cave Consulting LLC
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm
---
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm