Thanks. (Except for the part about etaionshrdlu... and the fact that I was unaware of any difference between prepend and prefix.) I had considered texts using different character sets, and figured I would be comfortable reporting them as completely different.
Thanks for the insights! (I should credit Skip Cave for first mentioning trigrams in this form long ago.) > On Oct 12, 2019, at 12:59 PM, 'Mike Day' via Programming > <programm...@jsoftware.com> wrote: > > The advantage is that you’re in control of the domain of interest. You know > all cases that _might_ arise. > > With my example of a given alphabet, we might be looking at the frequency of > letters in English prose, and find their sort order is along the lines of > etaionshrdlu... (iirc). > > Not so good if you come across kanji, say, assuming it’s something that’s > foreign to you, and you don’t know the alphabet/symbol set. Then you can > only work on the symbols, cases, that you encounter. > > Even then, though, you can compare two sets by using the nub of their union > as the basis for frequency analysis. In that case, I would prepend (not > prefix, pardon my slip earlier) that nub to each series, and, once again, > decrement all counts for each series by one. > > Any clearer? > > Cheers, > > Mike > > > > Sent from my iPad > >> On 12 Oct 2019, at 16:26, 'Jim Russell' via Programming >> <programm...@jsoftware.com> wrote: >> >> Thanks, interesting. What advantage that give you? >> >>>> On Oct 12, 2019, at 9:02 AM, 'Mike Day' via Programming >>>> <programm...@jsoftware.com> wrote: >>> >>> A quick thought, might not be what you have in mind. >>> >>> If, say, you’re seeking the frequency of letters, it’s worth prefixing the >>> sorted alphabet of interest to your string and then subtracting one from >>> the scores. >>> >>> Useful for me sometimes, anyway. >>> >>> Mike >>> >>> Sent from my iPad >>> >>>> On 12 Oct 2019, at 06:50, 'Jim Russell' via Programming >>>> <programm...@jsoftware.com> wrote: >>>> >>>> Looks promising. Typically, the strings are different lengths, and we may >>>> not have access to them at the same time. (Which is why I hade the >>>> intermediate summary step.) Let me ponder that (I don't think it will >>>> matter) while I study your approach more. Thanks very much! >>>> >>>>>> On Oct 12, 2019, at 1:22 AM, Ric Sherlock <tikk...@gmail.com> wrote: >>>>> >>>>> Here's one approach... >>>>> >>>>> I find it much easier to work with if there is actual data. The following >>>>> may not be representative of your data but it gives us somewhere to start. >>>>> >>>>> ]'X Y'=: 'actg' {~ 2 30 ?@$ 4 >>>>> >>>>> ggtaaaatgactgtagtgaagaaggagtcc >>>>> >>>>> ctgattaaggttcggtgtcgataccgcgca >>>>> >>>>> >>>>> We now have 2 strings X and Y. Let's obtain the trigrams for each string >>>>> >>>>> trig=: 3,\&.> X;Y Get the nub of the union of both sets of trigrams and >>>>> prepend it to each trigram set. supertrig=: (,~&.> <@~.@;) trig Now we can >>>>> use Key to count the trigrams in each set and decrement by 1 (for the >>>>> extra >>>>> copy that we added). <: #/.~&> supertrig >>>>> >>>>> 1 2 1 2 1 1 2 1 1 1 1 1 2 1 2 2 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 >>>>> >>>>> 2 0 1 0 0 0 1 0 0 1 1 0 0 1 0 1 0 1 0 0 1 0 2 1 1 1 1 2 1 1 1 1 1 1 2 1 1 >>>>> >>>>> Or to summarise by trigram: >>>>> >>>>> (~.@; trig);|: <: #/.~&> supertrig >>>>> >>>>> +---+---+ >>>>> >>>>> |ggt|1 2| >>>>> >>>>> |gta|2 0| >>>>> >>>>> |taa|1 1| >>>>> >>>>> |aaa|2 0| >>>>> >>>>> |aat|1 0| >>>>> >>>>> |atg|1 0| >>>>> >>>>> |tga|2 1| >>>>> >>>>> |gac|1 0| >>>>> >>>>> |act|1 0| >>>>> >>>>> |ctg|1 1| >>>>> >>>>> |tgt|1 1| >>>>> >>>>> |tag|1 0| >>>>> >>>>> |agt|2 0| >>>>> >>>>> |gtg|1 1| >>>>> >>>>> |gaa|2 0| >>>>> >>>>> |aag|2 1| >>>>> >>>>> |aga|1 0| >>>>> >>>>> |agg|1 1| >>>>> >>>>> |gga|1 0| >>>>> >>>>> |gag|1 0| >>>>> >>>>> |gtc|1 1| >>>>> >>>>> |tcc|1 0| >>>>> >>>>> |gat|0 2| >>>>> >>>>> |att|0 1| >>>>> >>>>> |tta|0 1| >>>>> >>>>> |gtt|0 1| >>>>> >>>>> |ttc|0 1| >>>>> >>>>> |tcg|0 2| >>>>> >>>>> |cgg|0 1| >>>>> >>>>> |cga|0 1| >>>>> >>>>> |ata|0 1| >>>>> >>>>> |tac|0 1| >>>>> >>>>> |acc|0 1| >>>>> >>>>> |ccg|0 1| >>>>> >>>>> |cgc|0 2| >>>>> >>>>> |gcg|0 1| >>>>> >>>>> |gca|0 1| >>>>> >>>>> +---+---+ >>>>> >>>>> >>>>>> On Sat, Oct 12, 2019 at 4:40 PM 'Jim Russell' via Programming < >>>>>> programm...@jsoftware.com> wrote: >>>>>> >>>>>> Sure, thanks. I'm working to re-implement a text comparison program I did >>>>>> using VBA & Microsoft Access a number of years back. >>>>>> >>>>>> The object is to compare two text documents and see how similar one is to >>>>>> the other by comparing the number of unique trigrams that are found in >>>>>> each. >>>>>> For each text string a table of trigrams is constructed with the >>>>>> expression 3,\x. The resulting table of 3-character samples m is then >>>>>> tallied using #/.~m . This yields a vector of counts of each unique >>>>>> trigram >>>>>> corresponding to (an unseen) nub of m. The count, and a copy of the nub >>>>>> of >>>>>> m, represent a summary of the text in string x. >>>>>> This same process then repeated to creat a smry for the second string, y. >>>>>> >>>>>> The next step in the process is to assign a score of 0 to 1 based on a >>>>>> comparison of the two string summaries. It would seem sensible to compare >>>>>> the nub of the two text strings to each other. What is the difference in >>>>>> counts between the trigrams they have in common, and how many trigram >>>>>> hits >>>>>> for each are unique? >>>>>> That is where using nub1 #/. nub2 would be attractive, were it not >>>>>> required that the arguments had the same row counts, and Key could not >>>>>> count unmatched rows. >>>>>> >>>>>> As it stands, I fear I am duplicating effort to find the nubs in >>>>>> preparing >>>>>> the summaries, and again if I have to use i. to calculate the scores. If >>>>>> I >>>>>> get a vector result when I use key on vectors, might I expect a table >>>>>> result (including the counts and the nub) when key is applied to tables? >>>>>> >>>>>> Or is there a more appropriate approach? (In access and VBA, I used >>>>>> dictionary objects with 3 character keys, as I recall. But I was very >>>>>> pleasantly surprised at how well the 3 character trigrams recognized text >>>>>> similarities.) >>>>>> >>>>>> I really appreciate any insights you might have, Ric, and thanks for >>>>>> tolerating my ignorance. >>>>>> >>>>>>>> On Oct 11, 2019, at 10:23 PM, Ric Sherlock <tikk...@gmail.com> wrote: >>>>>>> >>>>>>> Not sure I'm understanding your questions. Maybe including some of the >>>>>>> expressions you've tried to illustrate your points would help? >>>>>> >>>>>> ---------------------------------------------------------------------- >>>>>> For information about J forums see http://www.jsoftware.com/forums.htm >>>>>> >>>>> ---------------------------------------------------------------------- >>>>> For information about J forums see http://www.jsoftware.com/forums.htm >>>> >>>> ---------------------------------------------------------------------- >>>> For information about J forums see http://www.jsoftware.com/forums.htm >>> ---------------------------------------------------------------------- >>> For information about J forums see http://www.jsoftware.com/forums.htm >> >> ---------------------------------------------------------------------- >> For information about J forums see http://www.jsoftware.com/forums.htm > ---------------------------------------------------------------------- > For information about J forums see http://www.jsoftware.com/forums.htm ---------------------------------------------------------------------- For information about J forums see http://www.jsoftware.com/forums.htm