Here's one approach... I find it much easier to work with if there is actual data. The following may not be representative of your data but it gives us somewhere to start.
]'X Y'=: 'actg' {~ 2 30 ?@$ 4 ggtaaaatgactgtagtgaagaaggagtcc ctgattaaggttcggtgtcgataccgcgca We now have 2 strings X and Y. Let's obtain the trigrams for each string trig=: 3,\&.> X;Y Get the nub of the union of both sets of trigrams and prepend it to each trigram set. supertrig=: (,~&.> <@~.@;) trig Now we can use Key to count the trigrams in each set and decrement by 1 (for the extra copy that we added). <: #/.~&> supertrig 1 2 1 2 1 1 2 1 1 1 1 1 2 1 2 2 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 1 0 0 0 1 0 0 1 1 0 0 1 0 1 0 1 0 0 1 0 2 1 1 1 1 2 1 1 1 1 1 1 2 1 1 Or to summarise by trigram: (~.@; trig);|: <: #/.~&> supertrig +---+---+ |ggt|1 2| |gta|2 0| |taa|1 1| |aaa|2 0| |aat|1 0| |atg|1 0| |tga|2 1| |gac|1 0| |act|1 0| |ctg|1 1| |tgt|1 1| |tag|1 0| |agt|2 0| |gtg|1 1| |gaa|2 0| |aag|2 1| |aga|1 0| |agg|1 1| |gga|1 0| |gag|1 0| |gtc|1 1| |tcc|1 0| |gat|0 2| |att|0 1| |tta|0 1| |gtt|0 1| |ttc|0 1| |tcg|0 2| |cgg|0 1| |cga|0 1| |ata|0 1| |tac|0 1| |acc|0 1| |ccg|0 1| |cgc|0 2| |gcg|0 1| |gca|0 1| +---+---+ On Sat, Oct 12, 2019 at 4:40 PM 'Jim Russell' via Programming < programm...@jsoftware.com> wrote: > Sure, thanks. I'm working to re-implement a text comparison program I did > using VBA & Microsoft Access a number of years back. > > The object is to compare two text documents and see how similar one is to > the other by comparing the number of unique trigrams that are found in > each. > For each text string a table of trigrams is constructed with the > expression 3,\x. The resulting table of 3-character samples m is then > tallied using #/.~m . This yields a vector of counts of each unique trigram > corresponding to (an unseen) nub of m. The count, and a copy of the nub of > m, represent a summary of the text in string x. > This same process then repeated to creat a smry for the second string, y. > > The next step in the process is to assign a score of 0 to 1 based on a > comparison of the two string summaries. It would seem sensible to compare > the nub of the two text strings to each other. What is the difference in > counts between the trigrams they have in common, and how many trigram hits > for each are unique? > That is where using nub1 #/. nub2 would be attractive, were it not > required that the arguments had the same row counts, and Key could not > count unmatched rows. > > As it stands, I fear I am duplicating effort to find the nubs in preparing > the summaries, and again if I have to use i. to calculate the scores. If I > get a vector result when I use key on vectors, might I expect a table > result (including the counts and the nub) when key is applied to tables? > > Or is there a more appropriate approach? (In access and VBA, I used > dictionary objects with 3 character keys, as I recall. But I was very > pleasantly surprised at how well the 3 character trigrams recognized text > similarities.) > > I really appreciate any insights you might have, Ric, and thanks for > tolerating my ignorance. > > > On Oct 11, 2019, at 10:23 PM, Ric Sherlock <tikk...@gmail.com> wrote: > > > > Not sure I'm understanding your questions. Maybe including some of the > > expressions you've tried to illustrate your points would help? > > ---------------------------------------------------------------------- > For information about J forums see http://www.jsoftware.com/forums.htm > ---------------------------------------------------------------------- For information about J forums see http://www.jsoftware.com/forums.htm