Sorry, I wasn’t considering trigrams in my off the cuff stuff, Mike
Sent from my iPad > On 12 Oct 2019, at 18:38, 'Jim Russell' via Programming > <programm...@jsoftware.com> wrote: > > Thanks. (Except for the part about etaionshrdlu... and the fact that I was > unaware of any difference between prepend and prefix.) I had considered texts > using different character sets, and figured I would be comfortable reporting > them as completely different. > > Thanks for the insights! > > (I should credit Skip Cave for first mentioning trigrams in this form long > ago.) > >> On Oct 12, 2019, at 12:59 PM, 'Mike Day' via Programming >> <programm...@jsoftware.com> wrote: >> >> The advantage is that you’re in control of the domain of interest. You >> know all cases that _might_ arise. >> >> With my example of a given alphabet, we might be looking at the frequency of >> letters in English prose, and find their sort order is along the lines of >> etaionshrdlu... (iirc). >> >> Not so good if you come across kanji, say, assuming it’s something that’s >> foreign to you, and you don’t know the alphabet/symbol set. Then you can >> only work on the symbols, cases, that you encounter. >> >> Even then, though, you can compare two sets by using the nub of their union >> as the basis for frequency analysis. In that case, I would prepend (not >> prefix, pardon my slip earlier) that nub to each series, and, once again, >> decrement all counts for each series by one. >> >> Any clearer? >> >> Cheers, >> >> Mike >> >> >> >> Sent from my iPad >> >>> On 12 Oct 2019, at 16:26, 'Jim Russell' via Programming >>> <programm...@jsoftware.com> wrote: >>> >>> Thanks, interesting. What advantage that give you? >>> >>>>> On Oct 12, 2019, at 9:02 AM, 'Mike Day' via Programming >>>>> <programm...@jsoftware.com> wrote: >>>> >>>> A quick thought, might not be what you have in mind. >>>> >>>> If, say, you’re seeking the frequency of letters, it’s worth prefixing >>>> the sorted alphabet of interest to your string and then subtracting one >>>> from the scores. >>>> >>>> Useful for me sometimes, anyway. >>>> >>>> Mike >>>> >>>> Sent from my iPad >>>> >>>>> On 12 Oct 2019, at 06:50, 'Jim Russell' via Programming >>>>> <programm...@jsoftware.com> wrote: >>>>> >>>>> Looks promising. Typically, the strings are different lengths, and we may >>>>> not have access to them at the same time. (Which is why I hade the >>>>> intermediate summary step.) Let me ponder that (I don't think it will >>>>> matter) while I study your approach more. Thanks very much! >>>>> >>>>>>> On Oct 12, 2019, at 1:22 AM, Ric Sherlock <tikk...@gmail.com> wrote: >>>>>> >>>>>> Here's one approach... >>>>>> >>>>>> I find it much easier to work with if there is actual data. The following >>>>>> may not be representative of your data but it gives us somewhere to >>>>>> start. >>>>>> >>>>>> ]'X Y'=: 'actg' {~ 2 30 ?@$ 4 >>>>>> >>>>>> ggtaaaatgactgtagtgaagaaggagtcc >>>>>> >>>>>> ctgattaaggttcggtgtcgataccgcgca >>>>>> >>>>>> >>>>>> We now have 2 strings X and Y. Let's obtain the trigrams for each string >>>>>> >>>>>> trig=: 3,\&.> X;Y Get the nub of the union of both sets of trigrams and >>>>>> prepend it to each trigram set. supertrig=: (,~&.> <@~.@;) trig Now we >>>>>> can >>>>>> use Key to count the trigrams in each set and decrement by 1 (for the >>>>>> extra >>>>>> copy that we added). <: #/.~&> supertrig >>>>>> >>>>>> 1 2 1 2 1 1 2 1 1 1 1 1 2 1 2 2 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 >>>>>> >>>>>> 2 0 1 0 0 0 1 0 0 1 1 0 0 1 0 1 0 1 0 0 1 0 2 1 1 1 1 2 1 1 1 1 1 1 2 1 1 >>>>>> >>>>>> Or to summarise by trigram: >>>>>> >>>>>> (~.@; trig);|: <: #/.~&> supertrig >>>>>> >>>>>> +---+---+ >>>>>> >>>>>> |ggt|1 2| >>>>>> >>>>>> |gta|2 0| >>>>>> >>>>>> |taa|1 1| >>>>>> >>>>>> |aaa|2 0| >>>>>> >>>>>> |aat|1 0| >>>>>> >>>>>> |atg|1 0| >>>>>> >>>>>> |tga|2 1| >>>>>> >>>>>> |gac|1 0| >>>>>> >>>>>> |act|1 0| >>>>>> >>>>>> |ctg|1 1| >>>>>> >>>>>> |tgt|1 1| >>>>>> >>>>>> |tag|1 0| >>>>>> >>>>>> |agt|2 0| >>>>>> >>>>>> |gtg|1 1| >>>>>> >>>>>> |gaa|2 0| >>>>>> >>>>>> |aag|2 1| >>>>>> >>>>>> |aga|1 0| >>>>>> >>>>>> |agg|1 1| >>>>>> >>>>>> |gga|1 0| >>>>>> >>>>>> |gag|1 0| >>>>>> >>>>>> |gtc|1 1| >>>>>> >>>>>> |tcc|1 0| >>>>>> >>>>>> |gat|0 2| >>>>>> >>>>>> |att|0 1| >>>>>> >>>>>> |tta|0 1| >>>>>> >>>>>> |gtt|0 1| >>>>>> >>>>>> |ttc|0 1| >>>>>> >>>>>> |tcg|0 2| >>>>>> >>>>>> |cgg|0 1| >>>>>> >>>>>> |cga|0 1| >>>>>> >>>>>> |ata|0 1| >>>>>> >>>>>> |tac|0 1| >>>>>> >>>>>> |acc|0 1| >>>>>> >>>>>> |ccg|0 1| >>>>>> >>>>>> |cgc|0 2| >>>>>> >>>>>> |gcg|0 1| >>>>>> >>>>>> |gca|0 1| >>>>>> >>>>>> +---+---+ >>>>>> >>>>>> >>>>>>> On Sat, Oct 12, 2019 at 4:40 PM 'Jim Russell' via Programming < >>>>>>> programm...@jsoftware.com> wrote: >>>>>>> >>>>>>> Sure, thanks. I'm working to re-implement a text comparison program I >>>>>>> did >>>>>>> using VBA & Microsoft Access a number of years back. >>>>>>> >>>>>>> The object is to compare two text documents and see how similar one is >>>>>>> to >>>>>>> the other by comparing the number of unique trigrams that are found in >>>>>>> each. >>>>>>> For each text string a table of trigrams is constructed with the >>>>>>> expression 3,\x. The resulting table of 3-character samples m is then >>>>>>> tallied using #/.~m . This yields a vector of counts of each unique >>>>>>> trigram >>>>>>> corresponding to (an unseen) nub of m. The count, and a copy of the nub >>>>>>> of >>>>>>> m, represent a summary of the text in string x. >>>>>>> This same process then repeated to creat a smry for the second string, >>>>>>> y. >>>>>>> >>>>>>> The next step in the process is to assign a score of 0 to 1 based on a >>>>>>> comparison of the two string summaries. It would seem sensible to >>>>>>> compare >>>>>>> the nub of the two text strings to each other. What is the difference in >>>>>>> counts between the trigrams they have in common, and how many trigram >>>>>>> hits >>>>>>> for each are unique? >>>>>>> That is where using nub1 #/. nub2 would be attractive, were it not >>>>>>> required that the arguments had the same row counts, and Key could not >>>>>>> count unmatched rows. >>>>>>> >>>>>>> As it stands, I fear I am duplicating effort to find the nubs in >>>>>>> preparing >>>>>>> the summaries, and again if I have to use i. to calculate the scores. >>>>>>> If I >>>>>>> get a vector result when I use key on vectors, might I expect a table >>>>>>> result (including the counts and the nub) when key is applied to tables? >>>>>>> >>>>>>> Or is there a more appropriate approach? (In access and VBA, I used >>>>>>> dictionary objects with 3 character keys, as I recall. But I was very >>>>>>> pleasantly surprised at how well the 3 character trigrams recognized >>>>>>> text >>>>>>> similarities.) >>>>>>> >>>>>>> I really appreciate any insights you might have, Ric, and thanks for >>>>>>> tolerating my ignorance. >>>>>>> >>>>>>>>> On Oct 11, 2019, at 10:23 PM, Ric Sherlock <tikk...@gmail.com> wrote: >>>>>>>> >>>>>>>> Not sure I'm understanding your questions. Maybe including some of the >>>>>>>> expressions you've tried to illustrate your points would help? >>>>>>> >>>>>>> ---------------------------------------------------------------------- >>>>>>> For information about J forums see http://www.jsoftware.com/forums.htm >>>>>>> >>>>>> ---------------------------------------------------------------------- >>>>>> For information about J forums see http://www.jsoftware.com/forums.htm >>>>> >>>>> ---------------------------------------------------------------------- >>>>> For information about J forums see http://www.jsoftware.com/forums.htm >>>> ---------------------------------------------------------------------- >>>> For information about J forums see http://www.jsoftware.com/forums.htm >>> >>> ---------------------------------------------------------------------- >>> For information about J forums see http://www.jsoftware.com/forums.htm >> ---------------------------------------------------------------------- >> For information about J forums see http://www.jsoftware.com/forums.htm > > ---------------------------------------------------------------------- > For information about J forums see http://www.jsoftware.com/forums.htm ---------------------------------------------------------------------- For information about J forums see http://www.jsoftware.com/forums.htm