Sorry, I wasn’t considering trigrams in my off the cuff stuff,

Mike

Sent from my iPad

> On 12 Oct 2019, at 18:38, 'Jim Russell' via Programming 
> <programm...@jsoftware.com> wrote:
> 
> Thanks. (Except for the part about  etaionshrdlu... and the fact that I was 
> unaware of any difference between prepend and prefix.) I had considered texts 
> using different character sets, and figured I would be comfortable reporting 
> them as completely different. 
> 
> Thanks for the insights!
> 
> (I should credit Skip Cave for first mentioning trigrams in this form long 
> ago.)
> 
>> On Oct 12, 2019, at 12:59 PM, 'Mike Day' via Programming 
>> <programm...@jsoftware.com> wrote:
>> 
>> The advantage is that you’re in control of the domain of interest.  You 
>> know all cases that _might_ arise. 
>> 
>> With my example of a given alphabet, we might be looking at the frequency of 
>> letters in English prose,  and find their sort order is along the lines of 
>> etaionshrdlu... (iirc).
>> 
>> Not so good if you come across kanji, say, assuming it’s something that’s 
>> foreign to you, and you don’t know the alphabet/symbol set.  Then you can 
>> only work on the symbols, cases, that you encounter.
>> 
>> Even then, though, you can compare two sets by using the nub of their union 
>> as the basis for frequency analysis.  In that case,  I would prepend (not 
>> prefix, pardon my slip earlier) that nub to each series,  and, once again, 
>> decrement all counts for each series by one.
>> 
>> Any clearer?
>> 
>> Cheers,
>> 
>> Mike
>> 
>> 
>> 
>> Sent from my iPad
>> 
>>> On 12 Oct 2019, at 16:26, 'Jim Russell' via Programming 
>>> <programm...@jsoftware.com> wrote:
>>> 
>>> Thanks, interesting. What advantage  that give you?
>>> 
>>>>> On Oct 12, 2019, at 9:02 AM, 'Mike Day' via Programming 
>>>>> <programm...@jsoftware.com> wrote:
>>>> 
>>>> A quick thought,  might not be what you have in mind.
>>>> 
>>>> If, say, you’re seeking the frequency of letters,  it’s worth prefixing 
>>>> the sorted alphabet of interest to your string and then subtracting one 
>>>> from the scores.
>>>> 
>>>> Useful for me sometimes, anyway.
>>>> 
>>>> Mike
>>>> 
>>>> Sent from my iPad
>>>> 
>>>>> On 12 Oct 2019, at 06:50, 'Jim Russell' via Programming 
>>>>> <programm...@jsoftware.com> wrote:
>>>>> 
>>>>> Looks promising. Typically, the strings are different lengths, and we may 
>>>>> not have access to them at the same time. (Which is why I hade the 
>>>>> intermediate summary step.) Let me ponder that (I don't think it will 
>>>>> matter) while I study your approach more. Thanks very much!
>>>>> 
>>>>>>> On Oct 12, 2019, at 1:22 AM, Ric Sherlock <tikk...@gmail.com> wrote:
>>>>>> 
>>>>>> Here's one approach...
>>>>>> 
>>>>>> I find it much easier to work with if there is actual data. The following
>>>>>> may not be representative of your data but it gives us somewhere to 
>>>>>> start.
>>>>>> 
>>>>>> ]'X Y'=: 'actg' {~ 2 30 ?@$ 4
>>>>>> 
>>>>>> ggtaaaatgactgtagtgaagaaggagtcc
>>>>>> 
>>>>>> ctgattaaggttcggtgtcgataccgcgca
>>>>>> 
>>>>>> 
>>>>>> We now have 2 strings X and Y. Let's obtain the trigrams for each string
>>>>>> 
>>>>>> trig=: 3,\&.> X;Y Get the nub of the union of both sets of trigrams and
>>>>>> prepend it to each trigram set. supertrig=: (,~&.> <@~.@;) trig Now we 
>>>>>> can
>>>>>> use Key to count the trigrams in each set and decrement by 1 (for the 
>>>>>> extra
>>>>>> copy that we added). <: #/.~&> supertrig
>>>>>> 
>>>>>> 1 2 1 2 1 1 2 1 1 1 1 1 2 1 2 2 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
>>>>>> 
>>>>>> 2 0 1 0 0 0 1 0 0 1 1 0 0 1 0 1 0 1 0 0 1 0 2 1 1 1 1 2 1 1 1 1 1 1 2 1 1
>>>>>> 
>>>>>> Or to summarise by trigram:
>>>>>> 
>>>>>> (~.@; trig);|: <: #/.~&> supertrig
>>>>>> 
>>>>>> +---+---+
>>>>>> 
>>>>>> |ggt|1 2|
>>>>>> 
>>>>>> |gta|2 0|
>>>>>> 
>>>>>> |taa|1 1|
>>>>>> 
>>>>>> |aaa|2 0|
>>>>>> 
>>>>>> |aat|1 0|
>>>>>> 
>>>>>> |atg|1 0|
>>>>>> 
>>>>>> |tga|2 1|
>>>>>> 
>>>>>> |gac|1 0|
>>>>>> 
>>>>>> |act|1 0|
>>>>>> 
>>>>>> |ctg|1 1|
>>>>>> 
>>>>>> |tgt|1 1|
>>>>>> 
>>>>>> |tag|1 0|
>>>>>> 
>>>>>> |agt|2 0|
>>>>>> 
>>>>>> |gtg|1 1|
>>>>>> 
>>>>>> |gaa|2 0|
>>>>>> 
>>>>>> |aag|2 1|
>>>>>> 
>>>>>> |aga|1 0|
>>>>>> 
>>>>>> |agg|1 1|
>>>>>> 
>>>>>> |gga|1 0|
>>>>>> 
>>>>>> |gag|1 0|
>>>>>> 
>>>>>> |gtc|1 1|
>>>>>> 
>>>>>> |tcc|1 0|
>>>>>> 
>>>>>> |gat|0 2|
>>>>>> 
>>>>>> |att|0 1|
>>>>>> 
>>>>>> |tta|0 1|
>>>>>> 
>>>>>> |gtt|0 1|
>>>>>> 
>>>>>> |ttc|0 1|
>>>>>> 
>>>>>> |tcg|0 2|
>>>>>> 
>>>>>> |cgg|0 1|
>>>>>> 
>>>>>> |cga|0 1|
>>>>>> 
>>>>>> |ata|0 1|
>>>>>> 
>>>>>> |tac|0 1|
>>>>>> 
>>>>>> |acc|0 1|
>>>>>> 
>>>>>> |ccg|0 1|
>>>>>> 
>>>>>> |cgc|0 2|
>>>>>> 
>>>>>> |gcg|0 1|
>>>>>> 
>>>>>> |gca|0 1|
>>>>>> 
>>>>>> +---+---+
>>>>>> 
>>>>>> 
>>>>>>> On Sat, Oct 12, 2019 at 4:40 PM 'Jim Russell' via Programming <
>>>>>>> programm...@jsoftware.com> wrote:
>>>>>>> 
>>>>>>> Sure, thanks. I'm working to re-implement a text comparison program I 
>>>>>>> did
>>>>>>> using VBA & Microsoft Access a number of years back.
>>>>>>> 
>>>>>>> The object is to compare two text documents and see how similar one is 
>>>>>>> to
>>>>>>> the other by  comparing the number of unique trigrams that are found in
>>>>>>> each.
>>>>>>> For each text string a table of trigrams is constructed with the
>>>>>>> expression 3,\x. The resulting table of 3-character samples m is then
>>>>>>> tallied using #/.~m . This yields a vector of counts of each unique 
>>>>>>> trigram
>>>>>>> corresponding to (an unseen) nub of m. The count, and a copy of the nub 
>>>>>>> of
>>>>>>> m, represent a summary of the text in string x.
>>>>>>> This same process then repeated to creat a smry for the second string, 
>>>>>>> y.
>>>>>>> 
>>>>>>> The next step in the process is to assign a score of 0 to 1 based on a
>>>>>>> comparison of the two string summaries. It would seem sensible to 
>>>>>>> compare
>>>>>>> the nub of the two text strings to each other. What is the difference in
>>>>>>> counts between the trigrams they have in common, and how many trigram 
>>>>>>> hits
>>>>>>> for each are unique?
>>>>>>> That is where using nub1 #/. nub2 would be attractive, were it not
>>>>>>> required that the arguments had the same row counts, and Key could not
>>>>>>> count unmatched rows.
>>>>>>> 
>>>>>>> As it stands, I fear I am duplicating effort to find the nubs in 
>>>>>>> preparing
>>>>>>> the summaries, and again if I have to use i. to calculate the scores. 
>>>>>>> If I
>>>>>>> get a vector result when I use key on vectors, might I expect a table
>>>>>>> result (including the counts and the nub) when key is applied to tables?
>>>>>>> 
>>>>>>> Or is there a more appropriate approach? (In access and VBA, I used
>>>>>>> dictionary objects with 3 character keys, as I recall. But I was very
>>>>>>> pleasantly surprised at how well the 3 character trigrams recognized 
>>>>>>> text
>>>>>>> similarities.)
>>>>>>> 
>>>>>>> I really appreciate any insights you might have, Ric, and thanks for
>>>>>>> tolerating my ignorance.
>>>>>>> 
>>>>>>>>> On Oct 11, 2019, at 10:23 PM, Ric Sherlock <tikk...@gmail.com> wrote:
>>>>>>>> 
>>>>>>>> Not sure I'm understanding your questions. Maybe including some of the
>>>>>>>> expressions you've tried to illustrate your points would help?
>>>>>>> 
>>>>>>> ----------------------------------------------------------------------
>>>>>>> For information about J forums see http://www.jsoftware.com/forums.htm
>>>>>>> 
>>>>>> ----------------------------------------------------------------------
>>>>>> For information about J forums see http://www.jsoftware.com/forums.htm
>>>>> 
>>>>> ----------------------------------------------------------------------
>>>>> For information about J forums see http://www.jsoftware.com/forums.htm
>>>> ----------------------------------------------------------------------
>>>> For information about J forums see http://www.jsoftware.com/forums.htm
>>> 
>>> ----------------------------------------------------------------------
>>> For information about J forums see http://www.jsoftware.com/forums.htm
>> ----------------------------------------------------------------------
>> For information about J forums see http://www.jsoftware.com/forums.htm
> 
> ----------------------------------------------------------------------
> For information about J forums see http://www.jsoftware.com/forums.htm
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Reply via email to