Re: [Jprogramming] Word Count

Skip Cave Mon, 22 Jan 2018 10:43:21 -0800

My answers to Raul's questions:

1. Raul:

Are you looking for document statistics but do not care much about the
specifics of any of the documents? (A spam detector, for example)


Skip: The data is not from documents. It is transcribed text from
conversations
between a user and a support person or bot. I'm not very interested in the
individual conversations (there are hundreds of thousands of
conversations).
if I need to, I can separate the user and support utterances before
processing .

2. Raul:

Do you instead care about being able to find documents, but plan on
keeping the documents handy as well? (A search engine, for example)
.

Skip: I just want to get word/phrase frequencies across all the
conversations
as a total frequency spectrum for a specific domain. I don't need to
preserve
data on which conversation the phrases came from.

3. Raul:

Or are these dictionaries intended to be your primary representation
of your documents (which means you're discarding anything which these
dictionaries do not represent)?

Skip: No. These frequency stats dictionaries are to help train bots that
can better
understand human utterances in human/bot conversations.

4. Raul:

Or are you not intending to store information for the long term nor
are you working with a lot of documents, and you are instead in an
exploratory phase where flexibility is your primary concern?

Skip: The primary goal of the phrase frequency data is exploratory, to see
if
that data can be used to inform the design of algorithms that can better
"understand" the user's intent in human/bot conversations in a specific
domain.
 However, there could be terabytes of conversational data (transcribed
text).

Skip








Skip Cave
Cave Consulting LLC

On Mon, Jan 22, 2018 at 12:09 PM, Raul Miller <[email protected]> wrote:

> I think you are asking about what intermediate data structure should
> you be using when building these dictionaries.
>
> But I can think of several different ways of building things which
> match your description, and picking the right one would depend on what
> the consuming process would be doing.
>
> So, some questions:
>
> 
> Are you looking for document statistics but do not care much about the
> specifics of any of the documents? (A spam detector, for example)
>
> 
> Do you instead care about being able to find documents, but plan on
> keeping the documents handy as well? (A search engine, for example)
>
> 
> Or are these dictionaries intended to be your primary representation
> of your documents (which means you're discarding anything which these
> dictionaries do not represent)?
>
> 
> Or are you not intending to store information for the long term nor
> are you working with a lot of documents, and you are instead in an
> exploratory phase where flexibility is your primary concern?
>
> Or am I missing the point entirely and there's a better way of
> describing where you are going with this?
>
> Thanks,
>
> --
> Raul
>
>
> On Mon, Jan 22, 2018 at 12:57 PM, Skip Cave <[email protected]>
> wrote:
> > The word/phrase count verbs Raul & Mike propose work great. However, I
> > realized that I need a way to accumulate word/phrase counts over many
> > sentences. What is a reasonable architecture for creating a set of
> > word/phrase-count dictionaries, one dictionary each for single words, two
> > words, three words, and four words. An "addcount" verb would allow me to
> > keep adding the word/phrase counts of new sentences to the current
> > dictionaries, as I process each sentence?
> >
> > I envision four boxed nouns, each containing sorted lists of
> > different-length phrases and their counts.
> > The noun dictionaries would be labeled: 1word, 2word, 3word, and 4word.
> >
> > Skip
> >
> > Skip Cave
> > Cave Consulting LLC
> >
> > On Thu, Jan 18, 2018 at 10:08 AM, Raul Miller <[email protected]>
> wrote:
> >
> >> Ah, yes, I missed the bit about wrapping. I was in a hurry to get out
> >> the door and glossed over that part.
> >>
> >> Still, that's simple to add:
> >>
> >> words=: ;: NB. might change this because punctuation handling
> >>
> >> pc=: 4 :0
> >>   w=. x <@(;:inv)\ ($~ _1+x+#) words y
> >>   n=. #/.~ w
> >>   o=. \: n
> >>   (<"0 o{n),:o{~.w
> >> )
> >>
> >> Not quite the same implementation as your xwrap, but I think I prefer
> >> using reshape for something like this.
> >>
> >> Thanks,
> >>
> >> --
> >> Raul
> >>
> >>
> >> On Thu, Jan 18, 2018 at 9:56 AM, 'Mike Day' via Programming
> >> <[email protected]> wrote:
> >> > Raul's explicit verb is more readable,  than the following,  but I
> think
> >> > he's
> >> > overlooked your requirement for word-wrapping.
> >> >
> >> > As I understand that little extra,  one only needs to wrap 1 fewer
> words
> >> > than the
> >> > group-size.  I chose to wrap them at the end rather than at the front,
> >> > which your
> >> > examples portrayed.
> >> >
> >> > I've assumed ;: is sufficient for the time being.
> >> >
> >> > xwrap =: ([ , ({.~ <:))~    NB. tack on (x-1) words at end of phrase
> >> >
> >> > xgroup=: [ <@:(;:^:_1)\ ]   NB. form x-sized groups
> >> >
> >> > gwc   =: ({"1~ \:@{.)@(~. ,:~ <@#/.~) NB. dec sort numbers and nub of
> >> groups
> >> >
> >> > wordcount =: gwc@([ xgroup [ xwrap ;:@]) NB. combine the verbs
> >> >
> >> >
> >> >    5{."1 ]3 wordcount b  NB. 5{. to avoid email word-wrapping!?
> >> > +----------+----------+----------+-----------+---------+
> >> > |2         |1         |1         |1          |1        |
> >> > +----------+----------+----------+-----------+---------+
> >> > |in the hat|the cat in|cat in the|the hat ate|hat ate a|
> >> > +----------+----------+----------+-----------+---------+
> >> >
> >> > Might help a bit further,
> >> >
> >> >
> >> > Mike
> >> >
> >> >
> >> >
> >> >
> >> >
> >> > On 18/01/2018 08:23, Skip Cave wrote:
> >> >>
> >> >> I'm working on some Natural Language Processing algorithms.
> >> >>
> >> >> I built
> >> >> a
> >> >> basic
> >> >> set of
> >> >> word count verbs:
> >> >>
> >> >>
> >> >>
> >> >> NB. Test phrase
> >> >> :
> >> >>
> >> >>
> >> >>
> >> >> b =. 'the cat in the hat ate a hat and saw another cat in a hat in
> the
> >> >> hat'
> >> >>
> >> >>
> >> >>
> >> >> NB. Word count
> >> >>
> >> >>
> >> >>   wc =.3 :'#/.~;:y'
> >> >>
> >> >>
> >> >>
> >> >> NB. Labeled word count
> >> >>
> >> >>
> >> >>   lwc =.3 :'|:(;/#/.~;:y),.~.;:y'
> >> >>
> >> >>
> >> >> NB. Sorted &
> >> >> l
> >> >> abeled word count
> >> >>
> >> >> slwc =.3 :' (\:wc y){"1 lwc y'
> >> >>
> >> >> slwc b
> >> >> ┌───┬───┬──┬───┬─┬───┬───┬───┬───────┐
> >> >> │4
> >> >>
> >> >> │
> >> >>
> >> >> 3
> >> >>
> >> >>   │3 │2
> >> >>
> >> >> │2│1
> >> >>
> >> >> │1
> >> >>
> >> >> │1
> >> >>
> >> >> │1
> >> >>
> >> >> │
> >> >> ├───┼───┼──┼───┼─┼───┼───┼───┼───────┤
> >> >> │hat│the│in│cat│a│ate│and│saw│another│
> >> >> └───┴───┴──┴───┴─┴───┴───┴───┴───────┘
> >> >>
> >> >> Now I want to do the same thing for 2-word sequences (phrases) with a
> >> >> sliding window:
> >> >> |the cat|cat in|in the|the hat| .... etc.
> >> >> with wrap around the end:
> >> >> |the hat|hat the|the cat| .... etc.
> >> >>
> >> >> And 3-word sequences:
> >> >> |the cat in|cat in the|in the hat|.... etc.
> >> >> with wrap around the end:
> >> >> |in the hat|the hat the|hat the cat| ... etc
> >> >>
> >> >> And 4-word sequences, ... etc.
> >> >>
> >> >> Ideally, I would like a generalized phrase-count verb with the
> format:
> >> >>
> >> >> NB. Phrase count verb format:
> >> >> NB.  x pc y
> >> >> NB.  x= number of words in the phrase to be counted
> >> >> NB.  y= the text to be processed
> >> >>
> >> >> The output layout should be the same for all n-sequence counts - a
> 2-row
> >> >> sorted list of the boxed counts, on top of the associated boxed word
> >> >> sequence.
> >> >>
> >> >> Skip
> >> >>
> >> >> Skip Cave
> >> >> Cave Consulting LLC
> >> >> ------------------------------------------------------------
> ----------
> >> >> For information about J forums see http://www.jsoftware.com/
> forums.htm
> >> >
> >> >
> >> >
> >> >
> >> > ---
> >> > This email has been checked for viruses by Avast antivirus software.
> >> > https://www.avast.com/antivirus
> >> >
> >> > ------------------------------------------------------------
> ----------
> >> > For information about J forums see http://www.jsoftware.com/
> forums.htm
> >> ----------------------------------------------------------------------
> >> For information about J forums see http://www.jsoftware.com/forums.htm
> >>
> > ----------------------------------------------------------------------
> > For information about J forums see http://www.jsoftware.com/forums.htm
> ----------------------------------------------------------------------
> For information about J forums see http://www.jsoftware.com/forums.htm
>
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Re: [Jprogramming] Word Count

Reply via email to