Thanks. (Except for the part about  etaionshrdlu... and the fact that I was 
unaware of any difference between prepend and prefix.) I had considered texts 
using different character sets, and figured I would be comfortable reporting 
them as completely different. 

Thanks for the insights!

(I should credit Skip Cave for first mentioning trigrams in this form long ago.)

> On Oct 12, 2019, at 12:59 PM, 'Mike Day' via Programming 
> <programm...@jsoftware.com> wrote:
> 
> The advantage is that you’re in control of the domain of interest.  You know 
> all cases that _might_ arise. 
> 
> With my example of a given alphabet, we might be looking at the frequency of 
> letters in English prose,  and find their sort order is along the lines of 
> etaionshrdlu... (iirc).
> 
> Not so good if you come across kanji, say, assuming it’s something that’s 
> foreign to you, and you don’t know the alphabet/symbol set.  Then you can 
> only work on the symbols, cases, that you encounter.
> 
> Even then, though, you can compare two sets by using the nub of their union 
> as the basis for frequency analysis.  In that case,  I would prepend (not 
> prefix, pardon my slip earlier) that nub to each series,  and, once again, 
> decrement all counts for each series by one.
> 
> Any clearer?
> 
> Cheers,
> 
> Mike
> 
> 
> 
> Sent from my iPad
> 
>> On 12 Oct 2019, at 16:26, 'Jim Russell' via Programming 
>> <programm...@jsoftware.com> wrote:
>> 
>> Thanks, interesting. What advantage  that give you?
>> 
>>>> On Oct 12, 2019, at 9:02 AM, 'Mike Day' via Programming 
>>>> <programm...@jsoftware.com> wrote:
>>> 
>>> A quick thought,  might not be what you have in mind.
>>> 
>>> If, say, you’re seeking the frequency of letters,  it’s worth prefixing the 
>>> sorted alphabet of interest to your string and then subtracting one from 
>>> the scores.
>>> 
>>> Useful for me sometimes, anyway.
>>> 
>>> Mike
>>> 
>>> Sent from my iPad
>>> 
>>>> On 12 Oct 2019, at 06:50, 'Jim Russell' via Programming 
>>>> <programm...@jsoftware.com> wrote:
>>>> 
>>>> Looks promising. Typically, the strings are different lengths, and we may 
>>>> not have access to them at the same time. (Which is why I hade the 
>>>> intermediate summary step.) Let me ponder that (I don't think it will 
>>>> matter) while I study your approach more. Thanks very much!
>>>> 
>>>>>> On Oct 12, 2019, at 1:22 AM, Ric Sherlock <tikk...@gmail.com> wrote:
>>>>> 
>>>>> Here's one approach...
>>>>> 
>>>>> I find it much easier to work with if there is actual data. The following
>>>>> may not be representative of your data but it gives us somewhere to start.
>>>>> 
>>>>> ]'X Y'=: 'actg' {~ 2 30 ?@$ 4
>>>>> 
>>>>> ggtaaaatgactgtagtgaagaaggagtcc
>>>>> 
>>>>> ctgattaaggttcggtgtcgataccgcgca
>>>>> 
>>>>> 
>>>>> We now have 2 strings X and Y. Let's obtain the trigrams for each string
>>>>> 
>>>>> trig=: 3,\&.> X;Y Get the nub of the union of both sets of trigrams and
>>>>> prepend it to each trigram set. supertrig=: (,~&.> <@~.@;) trig Now we can
>>>>> use Key to count the trigrams in each set and decrement by 1 (for the 
>>>>> extra
>>>>> copy that we added). <: #/.~&> supertrig
>>>>> 
>>>>> 1 2 1 2 1 1 2 1 1 1 1 1 2 1 2 2 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
>>>>> 
>>>>> 2 0 1 0 0 0 1 0 0 1 1 0 0 1 0 1 0 1 0 0 1 0 2 1 1 1 1 2 1 1 1 1 1 1 2 1 1
>>>>> 
>>>>> Or to summarise by trigram:
>>>>> 
>>>>> (~.@; trig);|: <: #/.~&> supertrig
>>>>> 
>>>>> +---+---+
>>>>> 
>>>>> |ggt|1 2|
>>>>> 
>>>>> |gta|2 0|
>>>>> 
>>>>> |taa|1 1|
>>>>> 
>>>>> |aaa|2 0|
>>>>> 
>>>>> |aat|1 0|
>>>>> 
>>>>> |atg|1 0|
>>>>> 
>>>>> |tga|2 1|
>>>>> 
>>>>> |gac|1 0|
>>>>> 
>>>>> |act|1 0|
>>>>> 
>>>>> |ctg|1 1|
>>>>> 
>>>>> |tgt|1 1|
>>>>> 
>>>>> |tag|1 0|
>>>>> 
>>>>> |agt|2 0|
>>>>> 
>>>>> |gtg|1 1|
>>>>> 
>>>>> |gaa|2 0|
>>>>> 
>>>>> |aag|2 1|
>>>>> 
>>>>> |aga|1 0|
>>>>> 
>>>>> |agg|1 1|
>>>>> 
>>>>> |gga|1 0|
>>>>> 
>>>>> |gag|1 0|
>>>>> 
>>>>> |gtc|1 1|
>>>>> 
>>>>> |tcc|1 0|
>>>>> 
>>>>> |gat|0 2|
>>>>> 
>>>>> |att|0 1|
>>>>> 
>>>>> |tta|0 1|
>>>>> 
>>>>> |gtt|0 1|
>>>>> 
>>>>> |ttc|0 1|
>>>>> 
>>>>> |tcg|0 2|
>>>>> 
>>>>> |cgg|0 1|
>>>>> 
>>>>> |cga|0 1|
>>>>> 
>>>>> |ata|0 1|
>>>>> 
>>>>> |tac|0 1|
>>>>> 
>>>>> |acc|0 1|
>>>>> 
>>>>> |ccg|0 1|
>>>>> 
>>>>> |cgc|0 2|
>>>>> 
>>>>> |gcg|0 1|
>>>>> 
>>>>> |gca|0 1|
>>>>> 
>>>>> +---+---+
>>>>> 
>>>>> 
>>>>>> On Sat, Oct 12, 2019 at 4:40 PM 'Jim Russell' via Programming <
>>>>>> programm...@jsoftware.com> wrote:
>>>>>> 
>>>>>> Sure, thanks. I'm working to re-implement a text comparison program I did
>>>>>> using VBA & Microsoft Access a number of years back.
>>>>>> 
>>>>>> The object is to compare two text documents and see how similar one is to
>>>>>> the other by  comparing the number of unique trigrams that are found in
>>>>>> each.
>>>>>> For each text string a table of trigrams is constructed with the
>>>>>> expression 3,\x. The resulting table of 3-character samples m is then
>>>>>> tallied using #/.~m . This yields a vector of counts of each unique 
>>>>>> trigram
>>>>>> corresponding to (an unseen) nub of m. The count, and a copy of the nub 
>>>>>> of
>>>>>> m, represent a summary of the text in string x.
>>>>>> This same process then repeated to creat a smry for the second string, y.
>>>>>> 
>>>>>> The next step in the process is to assign a score of 0 to 1 based on a
>>>>>> comparison of the two string summaries. It would seem sensible to compare
>>>>>> the nub of the two text strings to each other. What is the difference in
>>>>>> counts between the trigrams they have in common, and how many trigram 
>>>>>> hits
>>>>>> for each are unique?
>>>>>> That is where using nub1 #/. nub2 would be attractive, were it not
>>>>>> required that the arguments had the same row counts, and Key could not
>>>>>> count unmatched rows.
>>>>>> 
>>>>>> As it stands, I fear I am duplicating effort to find the nubs in 
>>>>>> preparing
>>>>>> the summaries, and again if I have to use i. to calculate the scores. If 
>>>>>> I
>>>>>> get a vector result when I use key on vectors, might I expect a table
>>>>>> result (including the counts and the nub) when key is applied to tables?
>>>>>> 
>>>>>> Or is there a more appropriate approach? (In access and VBA, I used
>>>>>> dictionary objects with 3 character keys, as I recall. But I was very
>>>>>> pleasantly surprised at how well the 3 character trigrams recognized text
>>>>>> similarities.)
>>>>>> 
>>>>>> I really appreciate any insights you might have, Ric, and thanks for
>>>>>> tolerating my ignorance.
>>>>>> 
>>>>>>>> On Oct 11, 2019, at 10:23 PM, Ric Sherlock <tikk...@gmail.com> wrote:
>>>>>>> 
>>>>>>> Not sure I'm understanding your questions. Maybe including some of the
>>>>>>> expressions you've tried to illustrate your points would help?
>>>>>> 
>>>>>> ----------------------------------------------------------------------
>>>>>> For information about J forums see http://www.jsoftware.com/forums.htm
>>>>>> 
>>>>> ----------------------------------------------------------------------
>>>>> For information about J forums see http://www.jsoftware.com/forums.htm
>>>> 
>>>> ----------------------------------------------------------------------
>>>> For information about J forums see http://www.jsoftware.com/forums.htm
>>> ----------------------------------------------------------------------
>>> For information about J forums see http://www.jsoftware.com/forums.htm
>> 
>> ----------------------------------------------------------------------
>> For information about J forums see http://www.jsoftware.com/forums.htm
> ----------------------------------------------------------------------
> For information about J forums see http://www.jsoftware.com/forums.htm

----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Reply via email to