Here's one approach...

I find it much easier to work with if there is actual data. The following
may not be representative of your data but it gives us somewhere to start.

  ]'X Y'=: 'actg' {~ 2 30 ?@$ 4

ggtaaaatgactgtagtgaagaaggagtcc

ctgattaaggttcggtgtcgataccgcgca


We now have 2 strings X and Y. Let's obtain the trigrams for each string

trig=: 3,\&.> X;Y Get the nub of the union of both sets of trigrams and
prepend it to each trigram set. supertrig=: (,~&.> <@~.@;) trig Now we can
use Key to count the trigrams in each set and decrement by 1 (for the extra
copy that we added). <: #/.~&> supertrig

1 2 1 2 1 1 2 1 1 1 1 1 2 1 2 2 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

2 0 1 0 0 0 1 0 0 1 1 0 0 1 0 1 0 1 0 0 1 0 2 1 1 1 1 2 1 1 1 1 1 1 2 1 1

Or to summarise by trigram:

(~.@; trig);|: <: #/.~&> supertrig

+---+---+

|ggt|1 2|

|gta|2 0|

|taa|1 1|

|aaa|2 0|

|aat|1 0|

|atg|1 0|

|tga|2 1|

|gac|1 0|

|act|1 0|

|ctg|1 1|

|tgt|1 1|

|tag|1 0|

|agt|2 0|

|gtg|1 1|

|gaa|2 0|

|aag|2 1|

|aga|1 0|

|agg|1 1|

|gga|1 0|

|gag|1 0|

|gtc|1 1|

|tcc|1 0|

|gat|0 2|

|att|0 1|

|tta|0 1|

|gtt|0 1|

|ttc|0 1|

|tcg|0 2|

|cgg|0 1|

|cga|0 1|

|ata|0 1|

|tac|0 1|

|acc|0 1|

|ccg|0 1|

|cgc|0 2|

|gcg|0 1|

|gca|0 1|

+---+---+


On Sat, Oct 12, 2019 at 4:40 PM 'Jim Russell' via Programming <
programm...@jsoftware.com> wrote:

> Sure, thanks. I'm working to re-implement a text comparison program I did
> using VBA & Microsoft Access a number of years back.
>
> The object is to compare two text documents and see how similar one is to
> the other by  comparing the number of unique trigrams that are found in
> each.
> For each text string a table of trigrams is constructed with the
> expression 3,\x. The resulting table of 3-character samples m is then
> tallied using #/.~m . This yields a vector of counts of each unique trigram
> corresponding to (an unseen) nub of m. The count, and a copy of the nub of
> m, represent a summary of the text in string x.
> This same process then repeated to creat a smry for the second string, y.
>
> The next step in the process is to assign a score of 0 to 1 based on a
> comparison of the two string summaries. It would seem sensible to compare
> the nub of the two text strings to each other. What is the difference in
> counts between the trigrams they have in common, and how many trigram hits
> for each are unique?
> That is where using nub1 #/. nub2 would be attractive, were it not
> required that the arguments had the same row counts, and Key could not
> count unmatched rows.
>
> As it stands, I fear I am duplicating effort to find the nubs in preparing
> the summaries, and again if I have to use i. to calculate the scores. If I
> get a vector result when I use key on vectors, might I expect a table
> result (including the counts and the nub) when key is applied to tables?
>
> Or is there a more appropriate approach? (In access and VBA, I used
> dictionary objects with 3 character keys, as I recall. But I was very
> pleasantly surprised at how well the 3 character trigrams recognized text
> similarities.)
>
> I really appreciate any insights you might have, Ric, and thanks for
> tolerating my ignorance.
>
> > On Oct 11, 2019, at 10:23 PM, Ric Sherlock <tikk...@gmail.com> wrote:
> >
> > Not sure I'm understanding your questions. Maybe including some of the
> > expressions you've tried to illustrate your points would help?
>
> ----------------------------------------------------------------------
> For information about J forums see http://www.jsoftware.com/forums.htm
>
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Reply via email to