Thanks, interesting. What advantage that give you? > On Oct 12, 2019, at 9:02 AM, 'Mike Day' via Programming > <programm...@jsoftware.com> wrote: > > A quick thought, might not be what you have in mind. > > If, say, you’re seeking the frequency of letters, it’s worth prefixing the > sorted alphabet of interest to your string and then subtracting one from the > scores. > > Useful for me sometimes, anyway. > > Mike > > Sent from my iPad > >> On 12 Oct 2019, at 06:50, 'Jim Russell' via Programming >> <programm...@jsoftware.com> wrote: >> >> Looks promising. Typically, the strings are different lengths, and we may >> not have access to them at the same time. (Which is why I hade the >> intermediate summary step.) Let me ponder that (I don't think it will >> matter) while I study your approach more. Thanks very much! >> >>>> On Oct 12, 2019, at 1:22 AM, Ric Sherlock <tikk...@gmail.com> wrote: >>> >>> Here's one approach... >>> >>> I find it much easier to work with if there is actual data. The following >>> may not be representative of your data but it gives us somewhere to start. >>> >>> ]'X Y'=: 'actg' {~ 2 30 ?@$ 4 >>> >>> ggtaaaatgactgtagtgaagaaggagtcc >>> >>> ctgattaaggttcggtgtcgataccgcgca >>> >>> >>> We now have 2 strings X and Y. Let's obtain the trigrams for each string >>> >>> trig=: 3,\&.> X;Y Get the nub of the union of both sets of trigrams and >>> prepend it to each trigram set. supertrig=: (,~&.> <@~.@;) trig Now we can >>> use Key to count the trigrams in each set and decrement by 1 (for the extra >>> copy that we added). <: #/.~&> supertrig >>> >>> 1 2 1 2 1 1 2 1 1 1 1 1 2 1 2 2 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 >>> >>> 2 0 1 0 0 0 1 0 0 1 1 0 0 1 0 1 0 1 0 0 1 0 2 1 1 1 1 2 1 1 1 1 1 1 2 1 1 >>> >>> Or to summarise by trigram: >>> >>> (~.@; trig);|: <: #/.~&> supertrig >>> >>> +---+---+ >>> >>> |ggt|1 2| >>> >>> |gta|2 0| >>> >>> |taa|1 1| >>> >>> |aaa|2 0| >>> >>> |aat|1 0| >>> >>> |atg|1 0| >>> >>> |tga|2 1| >>> >>> |gac|1 0| >>> >>> |act|1 0| >>> >>> |ctg|1 1| >>> >>> |tgt|1 1| >>> >>> |tag|1 0| >>> >>> |agt|2 0| >>> >>> |gtg|1 1| >>> >>> |gaa|2 0| >>> >>> |aag|2 1| >>> >>> |aga|1 0| >>> >>> |agg|1 1| >>> >>> |gga|1 0| >>> >>> |gag|1 0| >>> >>> |gtc|1 1| >>> >>> |tcc|1 0| >>> >>> |gat|0 2| >>> >>> |att|0 1| >>> >>> |tta|0 1| >>> >>> |gtt|0 1| >>> >>> |ttc|0 1| >>> >>> |tcg|0 2| >>> >>> |cgg|0 1| >>> >>> |cga|0 1| >>> >>> |ata|0 1| >>> >>> |tac|0 1| >>> >>> |acc|0 1| >>> >>> |ccg|0 1| >>> >>> |cgc|0 2| >>> >>> |gcg|0 1| >>> >>> |gca|0 1| >>> >>> +---+---+ >>> >>> >>>> On Sat, Oct 12, 2019 at 4:40 PM 'Jim Russell' via Programming < >>>> programm...@jsoftware.com> wrote: >>>> >>>> Sure, thanks. I'm working to re-implement a text comparison program I did >>>> using VBA & Microsoft Access a number of years back. >>>> >>>> The object is to compare two text documents and see how similar one is to >>>> the other by comparing the number of unique trigrams that are found in >>>> each. >>>> For each text string a table of trigrams is constructed with the >>>> expression 3,\x. The resulting table of 3-character samples m is then >>>> tallied using #/.~m . This yields a vector of counts of each unique trigram >>>> corresponding to (an unseen) nub of m. The count, and a copy of the nub of >>>> m, represent a summary of the text in string x. >>>> This same process then repeated to creat a smry for the second string, y. >>>> >>>> The next step in the process is to assign a score of 0 to 1 based on a >>>> comparison of the two string summaries. It would seem sensible to compare >>>> the nub of the two text strings to each other. What is the difference in >>>> counts between the trigrams they have in common, and how many trigram hits >>>> for each are unique? >>>> That is where using nub1 #/. nub2 would be attractive, were it not >>>> required that the arguments had the same row counts, and Key could not >>>> count unmatched rows. >>>> >>>> As it stands, I fear I am duplicating effort to find the nubs in preparing >>>> the summaries, and again if I have to use i. to calculate the scores. If I >>>> get a vector result when I use key on vectors, might I expect a table >>>> result (including the counts and the nub) when key is applied to tables? >>>> >>>> Or is there a more appropriate approach? (In access and VBA, I used >>>> dictionary objects with 3 character keys, as I recall. But I was very >>>> pleasantly surprised at how well the 3 character trigrams recognized text >>>> similarities.) >>>> >>>> I really appreciate any insights you might have, Ric, and thanks for >>>> tolerating my ignorance. >>>> >>>>>> On Oct 11, 2019, at 10:23 PM, Ric Sherlock <tikk...@gmail.com> wrote: >>>>> >>>>> Not sure I'm understanding your questions. Maybe including some of the >>>>> expressions you've tried to illustrate your points would help? >>>> >>>> ---------------------------------------------------------------------- >>>> For information about J forums see http://www.jsoftware.com/forums.htm >>>> >>> ---------------------------------------------------------------------- >>> For information about J forums see http://www.jsoftware.com/forums.htm >> >> ---------------------------------------------------------------------- >> For information about J forums see http://www.jsoftware.com/forums.htm > ---------------------------------------------------------------------- > For information about J forums see http://www.jsoftware.com/forums.htm
---------------------------------------------------------------------- For information about J forums see http://www.jsoftware.com/forums.htm