The advantage is that you’re in control of the domain of interest.  You know 
all cases that _might_ arise. 

With my example of a given alphabet, we might be looking at the frequency of 
letters in English prose,  and find their sort order is along the lines of 
etaionshrdlu... (iirc).

Not so good if you come across kanji, say, assuming it’s something that’s 
foreign to you, and you don’t know the alphabet/symbol set.  Then you can only 
work on the symbols, cases, that you encounter.

Even then, though, you can compare two sets by using the nub of their union as 
the basis for frequency analysis.  In that case,  I would prepend (not prefix, 
pardon my slip earlier) that nub to each series,  and, once again, decrement 
all counts for each series by one.

Any clearer?

Cheers,

Mike



Sent from my iPad

> On 12 Oct 2019, at 16:26, 'Jim Russell' via Programming 
> <programm...@jsoftware.com> wrote:
> 
> Thanks, interesting. What advantage  that give you?
> 
>> On Oct 12, 2019, at 9:02 AM, 'Mike Day' via Programming 
>> <programm...@jsoftware.com> wrote:
>> 
>> A quick thought,  might not be what you have in mind.
>> 
>> If, say, you’re seeking the frequency of letters,  it’s worth prefixing the 
>> sorted alphabet of interest to your string and then subtracting one from the 
>> scores.
>> 
>> Useful for me sometimes, anyway.
>> 
>> Mike
>> 
>> Sent from my iPad
>> 
>>> On 12 Oct 2019, at 06:50, 'Jim Russell' via Programming 
>>> <programm...@jsoftware.com> wrote:
>>> 
>>> Looks promising. Typically, the strings are different lengths, and we may 
>>> not have access to them at the same time. (Which is why I hade the 
>>> intermediate summary step.) Let me ponder that (I don't think it will 
>>> matter) while I study your approach more. Thanks very much!
>>> 
>>>>> On Oct 12, 2019, at 1:22 AM, Ric Sherlock <tikk...@gmail.com> wrote:
>>>> 
>>>> Here's one approach...
>>>> 
>>>> I find it much easier to work with if there is actual data. The following
>>>> may not be representative of your data but it gives us somewhere to start.
>>>> 
>>>> ]'X Y'=: 'actg' {~ 2 30 ?@$ 4
>>>> 
>>>> ggtaaaatgactgtagtgaagaaggagtcc
>>>> 
>>>> ctgattaaggttcggtgtcgataccgcgca
>>>> 
>>>> 
>>>> We now have 2 strings X and Y. Let's obtain the trigrams for each string
>>>> 
>>>> trig=: 3,\&.> X;Y Get the nub of the union of both sets of trigrams and
>>>> prepend it to each trigram set. supertrig=: (,~&.> <@~.@;) trig Now we can
>>>> use Key to count the trigrams in each set and decrement by 1 (for the extra
>>>> copy that we added). <: #/.~&> supertrig
>>>> 
>>>> 1 2 1 2 1 1 2 1 1 1 1 1 2 1 2 2 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
>>>> 
>>>> 2 0 1 0 0 0 1 0 0 1 1 0 0 1 0 1 0 1 0 0 1 0 2 1 1 1 1 2 1 1 1 1 1 1 2 1 1
>>>> 
>>>> Or to summarise by trigram:
>>>> 
>>>> (~.@; trig);|: <: #/.~&> supertrig
>>>> 
>>>> +---+---+
>>>> 
>>>> |ggt|1 2|
>>>> 
>>>> |gta|2 0|
>>>> 
>>>> |taa|1 1|
>>>> 
>>>> |aaa|2 0|
>>>> 
>>>> |aat|1 0|
>>>> 
>>>> |atg|1 0|
>>>> 
>>>> |tga|2 1|
>>>> 
>>>> |gac|1 0|
>>>> 
>>>> |act|1 0|
>>>> 
>>>> |ctg|1 1|
>>>> 
>>>> |tgt|1 1|
>>>> 
>>>> |tag|1 0|
>>>> 
>>>> |agt|2 0|
>>>> 
>>>> |gtg|1 1|
>>>> 
>>>> |gaa|2 0|
>>>> 
>>>> |aag|2 1|
>>>> 
>>>> |aga|1 0|
>>>> 
>>>> |agg|1 1|
>>>> 
>>>> |gga|1 0|
>>>> 
>>>> |gag|1 0|
>>>> 
>>>> |gtc|1 1|
>>>> 
>>>> |tcc|1 0|
>>>> 
>>>> |gat|0 2|
>>>> 
>>>> |att|0 1|
>>>> 
>>>> |tta|0 1|
>>>> 
>>>> |gtt|0 1|
>>>> 
>>>> |ttc|0 1|
>>>> 
>>>> |tcg|0 2|
>>>> 
>>>> |cgg|0 1|
>>>> 
>>>> |cga|0 1|
>>>> 
>>>> |ata|0 1|
>>>> 
>>>> |tac|0 1|
>>>> 
>>>> |acc|0 1|
>>>> 
>>>> |ccg|0 1|
>>>> 
>>>> |cgc|0 2|
>>>> 
>>>> |gcg|0 1|
>>>> 
>>>> |gca|0 1|
>>>> 
>>>> +---+---+
>>>> 
>>>> 
>>>>> On Sat, Oct 12, 2019 at 4:40 PM 'Jim Russell' via Programming <
>>>>> programm...@jsoftware.com> wrote:
>>>>> 
>>>>> Sure, thanks. I'm working to re-implement a text comparison program I did
>>>>> using VBA & Microsoft Access a number of years back.
>>>>> 
>>>>> The object is to compare two text documents and see how similar one is to
>>>>> the other by  comparing the number of unique trigrams that are found in
>>>>> each.
>>>>> For each text string a table of trigrams is constructed with the
>>>>> expression 3,\x. The resulting table of 3-character samples m is then
>>>>> tallied using #/.~m . This yields a vector of counts of each unique 
>>>>> trigram
>>>>> corresponding to (an unseen) nub of m. The count, and a copy of the nub of
>>>>> m, represent a summary of the text in string x.
>>>>> This same process then repeated to creat a smry for the second string, y.
>>>>> 
>>>>> The next step in the process is to assign a score of 0 to 1 based on a
>>>>> comparison of the two string summaries. It would seem sensible to compare
>>>>> the nub of the two text strings to each other. What is the difference in
>>>>> counts between the trigrams they have in common, and how many trigram hits
>>>>> for each are unique?
>>>>> That is where using nub1 #/. nub2 would be attractive, were it not
>>>>> required that the arguments had the same row counts, and Key could not
>>>>> count unmatched rows.
>>>>> 
>>>>> As it stands, I fear I am duplicating effort to find the nubs in preparing
>>>>> the summaries, and again if I have to use i. to calculate the scores. If I
>>>>> get a vector result when I use key on vectors, might I expect a table
>>>>> result (including the counts and the nub) when key is applied to tables?
>>>>> 
>>>>> Or is there a more appropriate approach? (In access and VBA, I used
>>>>> dictionary objects with 3 character keys, as I recall. But I was very
>>>>> pleasantly surprised at how well the 3 character trigrams recognized text
>>>>> similarities.)
>>>>> 
>>>>> I really appreciate any insights you might have, Ric, and thanks for
>>>>> tolerating my ignorance.
>>>>> 
>>>>>>> On Oct 11, 2019, at 10:23 PM, Ric Sherlock <tikk...@gmail.com> wrote:
>>>>>> 
>>>>>> Not sure I'm understanding your questions. Maybe including some of the
>>>>>> expressions you've tried to illustrate your points would help?
>>>>> 
>>>>> ----------------------------------------------------------------------
>>>>> For information about J forums see http://www.jsoftware.com/forums.htm
>>>>> 
>>>> ----------------------------------------------------------------------
>>>> For information about J forums see http://www.jsoftware.com/forums.htm
>>> 
>>> ----------------------------------------------------------------------
>>> For information about J forums see http://www.jsoftware.com/forums.htm
>> ----------------------------------------------------------------------
>> For information about J forums see http://www.jsoftware.com/forums.htm
> 
> ----------------------------------------------------------------------
> For information about J forums see http://www.jsoftware.com/forums.htm
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Reply via email to