Re: [Jprogramming] Word Count

Raul Miller Mon, 22 Jan 2018 12:18:47 -0800

So, ok, let's say dictionaries entries are sequences of characters
regardless of how many words they are in that sequence.


Let's also imagine that dictionaries are sufficiently small to fit
into memory. Even if this is this is not actually the case, this can
still be the model used to develop the more complex approach.

But counts should parallel the dictionaries rather than go "in" them.
This isn't so much a topological distinction as a structural
observation; if you require that the counts be "in" the dictionary you
have to pay for that structure on every dictionary entry. If you push
this structure out of the dictionary you only pay for it once.

Also, it sounds like the only reason you care about sentences is so
that you can discard word sequences which cross sentence boundaries.
(Please verify or contradict this assumption.)

Still, that's probably enough for a first draft:

cleandicts=:3 :0
  DICT1=:DICT2=:DICT3=:DICT4=:''
  COUNT1=:COUNT2=:COUNT3=:COUNT4=:''
)

ref=:4 :0
  x,":y
)

getref=: ".@ref

setref=:1 :0
:
  (x ref m)=: y
)

entertranscripts=:3 :0
  1 2 3 4 entertransc&>/transcripts
)

boxwords=: ;:  NB. just an example - might want something different

entertransc=:3 :0
  sentences=. (y e. '.?!') <@(x <@(;:inv)\ boxwords);._2 y
  ref=. ~.;sentences
  n=. #/.~;sentences
  d=. 'DICT' get x
  c=. 'COUNT' get x
  'DICT' x setref d=.d,ref -. d
  c=. (#d){.c
  i=. d i. ref
  'COUNT' x setref (n+i{c) i} c
  EMPTY
)

That said, note that ~. and -. have relied on mechanisms which tend to
give poor performance for large (gigabyte) boxed lists. I do not
remember if this has been fixed in recent j engine updates.

So even if this does what you want, this still should be thought of as
"just a model" until it's gone through some benchmarks and performance
estimates.

(And now I have to get back to what I was supposed to be working on...)

Thanks,

-- 
Raul


On Mon, Jan 22, 2018 at 2:24 PM, Skip Cave <[email protected]> wrote:
> Raul,
>
> Generally, much "cleaning" is done on the textural data before it is
> processed for phrase frequrncies.
> 1. Cconvert all text to lower case
> 2. Stem or lemmatize  <https://goo.gl/sDcpa2>the words
> 3. Glue two or more consecutive outputs from a user together, if they are a
> continuation
>
> We can assume that all this cleaning has been done.
>
> As you surmised, sentences are typically terminated with a period,
> questionmark, or exclamation mark.
>
> Transctibed text will be boxed, one user input per box, with potentially
> multiple sentences in that box.
>
> I'm not interested in keeping track of all the places in the transcriptions
> where a specific phrase occured.
> Each dictionary should simply contain a list of n-word phrases (n is
> different for each dictionary),
> and the count of the number of times that phrase occured in the text.
>
> There should be a verb that initializes each dictionary.
> There should be a verb that will process text, calculates the phrase
> counts, and add those counts (and possibly new phrases) to the appropriate
> dictionary.
>
> Skip
>
> Skip Cave
> Cave Consulting LLC
>
> On Mon, Jan 22, 2018 at 12:58 PM, Raul Miller <[email protected]> wrote:
>
>> Ok...
>>
>> With this in mind, though: what is a "sentence"?
>>
>> My guess is that you are working from transcripts - each transcript
>> corresponds to one conversation, and that a sentence is a part of a
>> transcript ending in a period ('.') or question mark ('?') or possibly
>> even an exclamation mark ('!').
>>
>> And I guess I would substitute "transcript" for "document" in my thinking.
>>
>> ... also, ... it sounds like transcript identity is important.
>>
>> But I'm still a bit confused about how important it is to keep
>> "sequence information".
>>
>> So I'll just propose a few data structures.
>>
>> [1] Word dictionary: words from transcripts are normalized (translate
>> upper case to lower case) and entered in the word dictionary if they
>> are not already there. The transcript can then be represented as a
>> sequence of word indices. Parallel to that would be a same length list
>> of sentence indices in the transcript.
>>
>> [2] Word pair dictionary: word index pairs from the previous
>> transcript representation are entered into the word pair dictionary if
>> they are not already there. Word pairs which cross sentence boundaries
>> are ignored / discarded. The transcript can then be represented as a
>> sequence of word pair indices. Parallel to that would be a same length
>> list of sentence indices from the transcript.
>>
>> Word groups of 3 (and 4) words follow the same system as word pairs,
>> each with their own dictionary.
>>
>> If you needed to go in the reverse direction ("find the transcripts
>> which had these words"), you'd need an additional data structures
>> tracking the transcripts that contained each word. There are several
>> elaborations on that which could be useful, depending on what your
>> bottlenecks are.
>>
>> Thanks,
>>
>> --
>> Raul
>>
>> On Mon, Jan 22, 2018 at 1:42 PM, Skip Cave <[email protected]>
>> wrote:
>> > My answers to Raul's questions:
>> >
>> > 1. Raul:
>> >
>> > Are you looking for document statistics but do not care much about the
>> > specifics of any of the documents? (A spam detector, for example)
>> >
>> > Skip: The data is not from documents. It is transcribed text from
>> > conversations
>> > between a user and a support person or bot. I'm not very interested in
>> the
>> > individual conversations (there are hundreds of thousands of
>> > conversations).
>> > if I need to, I can separate the user and support utterances before
>> > processing .
>> >
>> > 2. Raul:
>> >
>> > Do you instead care about being able to find documents, but plan on
>> > keeping the documents handy as well? (A search engine, for example)
>> > .
>> >
>> > Skip: I just want to get word/phrase frequencies across all the
>> > conversations
>> > as a total frequency spectrum for a specific domain. I don't need to
>> > preserve
>> > data on which conversation the phrases came from.
>> >
>> > 3. Raul:
>> >
>> > Or are these dictionaries intended to be your primary representation
>> > of your documents (which means you're discarding anything which these
>> > dictionaries do not represent)?
>> >
>> > Skip: No. These frequency stats dictionaries are to help train bots that
>> > can better
>> > understand human utterances in human/bot conversations.
>> >
>> > 4. Raul:
>> >
>> > Or are you not intending to store information for the long term nor
>> > are you working with a lot of documents, and you are instead in an
>> > exploratory phase where flexibility is your primary concern?
>> >
>> > Skip: The primary goal of the phrase frequency data is exploratory, to
>> see
>> > if
>> > that data can be used to inform the design of algorithms that can better
>> > "understand" the user's intent in human/bot conversations in a specific
>> > domain.
>> >  However, there could be terabytes of conversational data (transcribed
>> > text).
>> >
>> > Skip
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> > Skip Cave
>> > Cave Consulting LLC
>> >
>> > On Mon, Jan 22, 2018 at 12:09 PM, Raul Miller <[email protected]>
>> wrote:
>> >
>> >> I think you are asking about what intermediate data structure should
>> >> you be using when building these dictionaries.
>> >>
>> >> But I can think of several different ways of building things which
>> >> match your description, and picking the right one would depend on what
>> >> the consuming process would be doing.
>> >>
>> >> So, some questions:
>> >>
>> >>
>> >> Are you looking for document statistics but do not care much about the
>> >> specifics of any of the documents? (A spam detector, for example)
>> >>
>> >>
>> >> Do you instead care about being able to find documents, but plan on
>> >> keeping the documents handy as well? (A search engine, for example)
>> >>
>> >>
>> >> Or are these dictionaries intended to be your primary representation
>> >> of your documents (which means you're discarding anything which these
>> >> dictionaries do not represent)?
>> >>
>> >>
>> >> Or are you not intending to store information for the long term nor
>> >> are you working with a lot of documents, and you are instead in an
>> >> exploratory phase where flexibility is your primary concern?
>> >>
>> >> Or am I missing the point entirely and there's a better way of
>> >> describing where you are going with this?
>> >>
>> >> Thanks,
>> >>
>> >> --
>> >> Raul
>> >>
>> >>
>> >> On Mon, Jan 22, 2018 at 12:57 PM, Skip Cave <[email protected]>
>> >> wrote:
>> >> > The word/phrase count verbs Raul & Mike propose work great. However, I
>> >> > realized that I need a way to accumulate word/phrase counts over many
>> >> > sentences. What is a reasonable architecture for creating a set of
>> >> > word/phrase-count dictionaries, one dictionary each for single words,
>> two
>> >> > words, three words, and four words. An "addcount" verb would allow me
>> to
>> >> > keep adding the word/phrase counts of new sentences to the current
>> >> > dictionaries, as I process each sentence?
>> >> >
>> >> > I envision four boxed nouns, each containing sorted lists of
>> >> > different-length phrases and their counts.
>> >> > The noun dictionaries would be labeled: 1word, 2word, 3word, and
>> 4word.
>> >> >
>> >> > Skip
>> >> >
>> >> > Skip Cave
>> >> > Cave Consulting LLC
>> >> >
>> >> > On Thu, Jan 18, 2018 at 10:08 AM, Raul Miller <[email protected]>
>> >> wrote:
>> >> >
>> >> >> Ah, yes, I missed the bit about wrapping. I was in a hurry to get out
>> >> >> the door and glossed over that part.
>> >> >>
>> >> >> Still, that's simple to add:
>> >> >>
>> >> >> words=: ;: NB. might change this because punctuation handling
>> >> >>
>> >> >> pc=: 4 :0
>> >> >>   w=. x <@(;:inv)\ ($~ _1+x+#) words y
>> >> >>   n=. #/.~ w
>> >> >>   o=. \: n
>> >> >>   (<"0 o{n),:o{~.w
>> >> >> )
>> >> >>
>> >> >> Not quite the same implementation as your xwrap, but I think I prefer
>> >> >> using reshape for something like this.
>> >> >>
>> >> >> Thanks,
>> >> >>
>> >> >> --
>> >> >> Raul
>> >> >>
>> >> >>
>> >> >> On Thu, Jan 18, 2018 at 9:56 AM, 'Mike Day' via Programming
>> >> >> <[email protected]> wrote:
>> >> >> > Raul's explicit verb is more readable,  than the following,  but I
>> >> think
>> >> >> > he's
>> >> >> > overlooked your requirement for word-wrapping.
>> >> >> >
>> >> >> > As I understand that little extra,  one only needs to wrap 1 fewer
>> >> words
>> >> >> > than the
>> >> >> > group-size.  I chose to wrap them at the end rather than at the
>> front,
>> >> >> > which your
>> >> >> > examples portrayed.
>> >> >> >
>> >> >> > I've assumed ;: is sufficient for the time being.
>> >> >> >
>> >> >> > xwrap =: ([ , ({.~ <:))~    NB. tack on (x-1) words at end of
>> phrase
>> >> >> >
>> >> >> > xgroup=: [ <@:(;:^:_1)\ ]   NB. form x-sized groups
>> >> >> >
>> >> >> > gwc   =: ({"1~ \:@{.)@(~. ,:~ <@#/.~) NB. dec sort numbers and nub
>> of
>> >> >> groups
>> >> >> >
>> >> >> > wordcount =: gwc@([ xgroup [ xwrap ;:@]) NB. combine the verbs
>> >> >> >
>> >> >> >
>> >> >> >    5{."1 ]3 wordcount b  NB. 5{. to avoid email word-wrapping!?
>> >> >> > +----------+----------+----------+-----------+---------+
>> >> >> > |2         |1         |1         |1          |1        |
>> >> >> > +----------+----------+----------+-----------+---------+
>> >> >> > |in the hat|the cat in|cat in the|the hat ate|hat ate a|
>> >> >> > +----------+----------+----------+-----------+---------+
>> >> >> >
>> >> >> > Might help a bit further,
>> >> >> >
>> >> >> >
>> >> >> > Mike
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> > On 18/01/2018 08:23, Skip Cave wrote:
>> >> >> >>
>> >> >> >> I'm working on some Natural Language Processing algorithms.
>> >> >> >>
>> >> >> >> I built
>> >> >> >> a
>> >> >> >> basic
>> >> >> >> set of
>> >> >> >> word count verbs:
>> >> >> >>
>> >> >> >>
>> >> >> >>
>> >> >> >> NB. Test phrase
>> >> >> >> :
>> >> >> >>
>> >> >> >>
>> >> >> >>
>> >> >> >> b =. 'the cat in the hat ate a hat and saw another cat in a hat in
>> >> the
>> >> >> >> hat'
>> >> >> >>
>> >> >> >>
>> >> >> >>
>> >> >> >> NB. Word count
>> >> >> >>
>> >> >> >>
>> >> >> >>   wc =.3 :'#/.~;:y'
>> >> >> >>
>> >> >> >>
>> >> >> >>
>> >> >> >> NB. Labeled word count
>> >> >> >>
>> >> >> >>
>> >> >> >>   lwc =.3 :'|:(;/#/.~;:y),.~.;:y'
>> >> >> >>
>> >> >> >>
>> >> >> >> NB. Sorted &
>> >> >> >> l
>> >> >> >> abeled word count
>> >> >> >>
>> >> >> >> slwc =.3 :' (\:wc y){"1 lwc y'
>> >> >> >>
>> >> >> >> slwc b
>> >> >> >> ┌───┬───┬──┬───┬─┬───┬───┬───┬───────┐
>> >> >> >> │4
>> >> >> >>
>> >> >> >> │
>> >> >> >>
>> >> >> >> 3
>> >> >> >>
>> >> >> >>   │3 │2
>> >> >> >>
>> >> >> >> │2│1
>> >> >> >>
>> >> >> >> │1
>> >> >> >>
>> >> >> >> │1
>> >> >> >>
>> >> >> >> │1
>> >> >> >>
>> >> >> >> │
>> >> >> >> ├───┼───┼──┼───┼─┼───┼───┼───┼───────┤
>> >> >> >> │hat│the│in│cat│a│ate│and│saw│another│
>> >> >> >> └───┴───┴──┴───┴─┴───┴───┴───┴───────┘
>> >> >> >>
>> >> >> >> Now I want to do the same thing for 2-word sequences (phrases)
>> with a
>> >> >> >> sliding window:
>> >> >> >> |the cat|cat in|in the|the hat| .... etc.
>> >> >> >> with wrap around the end:
>> >> >> >> |the hat|hat the|the cat| .... etc.
>> >> >> >>
>> >> >> >> And 3-word sequences:
>> >> >> >> |the cat in|cat in the|in the hat|.... etc.
>> >> >> >> with wrap around the end:
>> >> >> >> |in the hat|the hat the|hat the cat| ... etc
>> >> >> >>
>> >> >> >> And 4-word sequences, ... etc.
>> >> >> >>
>> >> >> >> Ideally, I would like a generalized phrase-count verb with the
>> >> format:
>> >> >> >>
>> >> >> >> NB. Phrase count verb format:
>> >> >> >> NB.  x pc y
>> >> >> >> NB.  x= number of words in the phrase to be counted
>> >> >> >> NB.  y= the text to be processed
>> >> >> >>
>> >> >> >> The output layout should be the same for all n-sequence counts - a
>> >> 2-row
>> >> >> >> sorted list of the boxed counts, on top of the associated boxed
>> word
>> >> >> >> sequence.
>> >> >> >>
>> >> >> >> Skip
>> >> >> >>
>> >> >> >> Skip Cave
>> >> >> >> Cave Consulting LLC
>> >> >> >> ------------------------------------------------------------
>> >> ----------
>> >> >> >> For information about J forums see http://www.jsoftware.com/
>> >> forums.htm
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> > ---
>> >> >> > This email has been checked for viruses by Avast antivirus
>> software.
>> >> >> > https://www.avast.com/antivirus
>> >> >> >
>> >> >> > ------------------------------------------------------------
>> >> ----------
>> >> >> > For information about J forums see http://www.jsoftware.com/
>> >> forums.htm
>> >> >> ------------------------------------------------------------
>> ----------
>> >> >> For information about J forums see http://www.jsoftware.com/
>> forums.htm
>> >> >>
>> >> > ------------------------------------------------------------
>> ----------
>> >> > For information about J forums see http://www.jsoftware.com/
>> forums.htm
>> >> ----------------------------------------------------------------------
>> >> For information about J forums see http://www.jsoftware.com/forums.htm
>> >>
>> > ----------------------------------------------------------------------
>> > For information about J forums see http://www.jsoftware.com/forums.htm
>> ----------------------------------------------------------------------
>> For information about J forums see http://www.jsoftware.com/forums.htm
>>
> ----------------------------------------------------------------------
> For information about J forums see http://www.jsoftware.com/forums.htm
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Re: [Jprogramming] Word Count

Reply via email to