Re: [Jprogramming] finding matching sets

'Pascal Jasmin' via Programming Sat, 26 Jul 2014 16:44:24 -0700

I'm very curious as to how it applies to Joe's data.  Seems hard to find other 
people's benchmarks on it.  Joe did mention that the problem leaned towards a 
hashing approach.


I can say that symbols are supposed to help with string matching including 
memory issues, and that they provide greater improvement the more repetitive 
your strings are.  This latter point is why the context of stock quotes is 
provided as the canonical example, and I understand to be a core basis for the 
speed edge of systems like kdb in that domain.


I too had hoped there was a guideline for the amount of repetetivity that 
justifies using symbols.  For this problem it may be very little repetition, 
since for all we know the only purpose of the data is to perform this single 
search.  I would guess there to be a significant speed improvement on the 
search part, even with little repetition, but have no idea if that improvement 
is countered by the cost of doing the s: transformation.


Another approach, similar but likely slower than symbols would be to build a 
data structure with one row per key, but with the right column holding a list 
of indexes into a string or symbol table, as in:  

'c1' ; 0 1
'c2' ; 0 1
'c3' ; 0 1 2

RE Boss's solution makes that as first step

If you are not necessarily interested in the specific match pairings but rather 
want a match count, =/~ may be a good next function:

   =/~ ({."1 </.}."1) ( ({. (, <)  [: s: <@:(1&{::  , ' ' , [: ": 2&{::))) "1 t
or
   =/~ 0 1; 0 1 ; 0 1 2 
1 1 0 
1 1 0 
0 0 1

time for a benchmark:

first, use boxed indices as suggested above:

s0 =. <@:/:~"1 ?10000 10$ 20

   timespacex '([: +/ =/)~ s0' NB. counts the number of other identical 
datasets for each key
6.39809 1.34369e8

with symbols of same (actually similar shaped random ...) data

s1 =. s:@:<@:":@:/:~"1 ?10000 10$ 20
   timespacex '([: +/ =/)~ s1' 
0.215074 1.34369e8 

the actual repetition is fairly low for that dataset, (8 I think)

   */ ([: +/ =/)~ s1 
16384 






----- Original Message -----
From: robert therriault <[email protected]>
To: [email protected]
Cc: 
Sent: Saturday, July 26, 2014 5:34:27 PM
Subject: Re: [Jprogramming] finding matching sets

To quote Joey Tuttle from the Jconference...benchmarks. :-)

There are a lot of things that go into the decision to use symbols and in the 
end you test and decide if it provides the performance you want.

Cheers, bob

On Jul 26, 2014, at 1:21 PM, R.E. Boss <[email protected]> wrote:

> What information is available about difference in  performance between using 
> symbols and indices?
> 
> 
> R.E. Boss
> 
> (Add your info to http://www.jsoftware.com/jwiki/Community/Demographics )
> 
> 
> 
> 
>> -----Original Message-----
>> From: [email protected] [mailto:programming-
>> [email protected]] On Behalf Of 'Pascal Jasmin' via
>> Programming
>> Sent: zaterdag 26 juli 2014 17:20
>> To: [email protected]
>> Subject: Re: [Jprogramming] finding matching sets
>> 
>> If your non key columns are very repetitive, you might find symbols
>> worthwhile:
>> 
>> ({. (, <) [: s: }.)"1 t
>> ┌──┬─────────┐
>> │c1│`p1 `0.25│
>> ├──┼─────────┤
>> │c1│`p2 `0.35│
>> ├──┼─────────┤
>> │c2│`p1 `0.25│
>> ├──┼─────────┤
>> │c2│`p2 `0.35│
>> ├──┼─────────┤
>> │c3│`p1 `0.25│
>> ├──┼─────────┤
>> │c3│`p2 `0.35│
>> ├──┼─────────┤
>> │c3│`p3 `0.45│
>> └──┴─────────┘
>> 
>> this alternative makes the non key columns into one symbol which should
>> make comparisons faster for just your requirement.
>> 
>>   ({. (, <)  [: s: <@:(1&{::  , ' ' , [: ": 2&{::))"1 t
>> ┌──┬────────┐
>> │c1│`p1 0.25│
>> ├──┼────────┤
>> │c1│`p2 0.35│
>> ├──┼────────┤
>> │c2│`p1 0.25│
>> ├──┼────────┤
>> │c2│`p2 0.35│
>> ├──┼────────┤
>> │c3│`p1 0.25│
>> ├──┼────────┤
>> │c3│`p2 0.35│
>> ├──┼────────┤
>> │c3│`p3 0.45│
>> └──┴────────┘
>> 
>> sticking the bossman's function ahead of it
>> 
>> (([: ~. {."1) </.~ [: (i.~ ~.)({."1 </.}."1)) symboled=. ( ({. (, <)  [: s: 
>> <@:(1&{::  , '
>> ' , [: ": 2&{::))) "1 t
>> ┌───────┬────┐
>> │┌──┬──┐│┌──┐│
>> ││c1│c2│││c3││
>> │└──┴──┘│└──┘│
>> └───────┴────┘
>> 
>> 
>> ----- Original Message -----
>> From: R.E. Boss <[email protected]>
>> To: [email protected]
>> Cc:
>> Sent: Saturday, July 26, 2014 6:40:45 AM
>> Subject: Re: [Jprogramming] finding matching sets
>> 
>> 
>>   (~.{."1 t)</.~ (i.~ ~.)({."1 </.}."1) t
>> +-------+----+
>> |+--+--+|+--+|
>> ||c1|c2|||c3||
>> |+--+--+|+--+|
>> +-------+----+
>> 
>> If that costs too much memory, try
>> 
>>   [T=:(i.~ ~.)"1 &.|:t
>> 0 0 0
>> 0 1 1
>> 1 0 0
>> 1 1 1
>> 2 0 0
>> 2 1 1
>> 2 2 2
>> 
>>   (~.{."1 t)</.~ (i.~ ~.)({."1 </.}."1) T
>> +-------+----+
>> |+--+--+|+--+|
>> ||c1|c2|||c3||
>> |+--+--+|+--+|
>> +-------+----+
>> 
>> 
>> R.E. Boss
>> 
>> (Add your info to
>> http://www.jsoftware.com/jwiki/Community/Demographics )
>> 
>> 
>> 
>>> -----Original Message-----
>>> From: [email protected]
>> [mailto:programming-
>>> [email protected]] On Behalf Of Joe Bogner
>>> Sent: vrijdag 25 juli 2014 13:08
>>> To: [email protected]
>>> Subject: [Jprogramming] finding matching sets
>>> 
>>> Given the following data:
>>> 
>>> t =: ;: ;._2 (0 : 0)
>>> c1 p1 0.25
>>> c1 p2 0.35
>>> c2 p1 0.25
>>> c2 p2 0.35
>>> c3 p1 0.25
>>> c3 p2 0.35
>>> c3 p3 0.45
>>> )
>>> 
>>> 
>>> c1 has two rows (p1 0.25) and (p2 0.35)
>>> c2 has two rows (p1 0.25) and (p2 0.35)
>>> c3 has three rows (p1 025), (p2 0.35), (p3 0.45)
>>> 
>>> How can I identify that c1 and c2 have the same set of values and that c3
>>> is different?
>>> 
>>> I'd like to run the algorithm on a 1.6M row table
>>> 
>>> I created a prototype in javascript using a rough approach, but I haven't
>>> translated it to J in case there is a better way:
>>> 
>>> 1. Sort array by column 2 (product)
>>> 2. Loop through the array and create a hash table of the concatenated
>>> product/value pair (e.g: p2 0.35)  for each customer
>>> 3. Loop through the hash table and create a list of customers for each
>>> unique string of product/value pairs
>>> 
>>> var t = function(){/*
>>> c1 p1 0.25
>>> c1 p2 0.35
>>> c2 p1 0.25
>>> c2 p2 0.35
>>> c3 p1 0.25
>>> c3 p2 0.35
>>> c3 p3 0.45
>>> */}.toString().slice(15,-4).split('\n').map(function(x) { return x.split('
>>> ') })
>>> t = t.sort(function(x,y) { return x[1]>y[1] })
>>> 
>>> var cs = t.reduce(function(memo,val) { memo[val[0]] =
>>> (memo[val[0]]||'')+val[1]+val[2]; return memo;}, {});
>>> 
>>> //JSON.stringify(cs)
>>> //"{"c1":"p10.25p20.35","c2":"p10.25p20.35","c3":"p10.25p20.35p30.45"}"
>>> 
>>> var matches = Object.keys(cs).reduce(function(memo,val) { var key =
>>> memo[cs[val]] = (memo[cs[val]] || []); key.push(val);  return memo;}, {})
>>> 
>>> JSON.stringify(matches)
>>> 
>>> "{"p10.25p20.35":["c1","c2"],"p10.25p20.35p30.45":["c3"]}"
>>> 
>>> How should this problem be approached in J?
>>> ----------------------------------------------------------------------
>>> For information about J forums see
>> http://www.jsoftware.com/forums.htm



>> 
>> 
>> 
>> ----------------------------------------------------------------------
>> For information about J forums see http://www.jsoftware.com/forums.htm
>> 
>> ----------------------------------------------------------------------
>> For information about J forums see http://www.jsoftware.com/forums.htm
> ----------------------------------------------------------------------
> For information about J forums see http://www.jsoftware.com/forums.htm

----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Re: [Jprogramming] finding matching sets

Reply via email to