I should add that most of the speed gain that comes from the symbol implementation is based on comparing unboxed data. If I box the symbols, then there is only a 15% or so gain relative to integer lists.
timespacex '([: +/ =/)~ ;/ s1' 5.46913 1.35863e8 comparing boxes of boxes hurts even more: s2 =. ;/ each s0 NB. box inter timespacex'([: +/ =/)~ s2' 37.4153 1.34369e8 So one nice use of symbols is that you could use unboxed data without space fills for the comparisons, and you can convert any boxed data into a unique symbol by using the lr (5!:5) of the (boxed) rows. ----- Original Message ----- From: 'Pascal Jasmin' via Programming <[email protected]> To: "[email protected]" <[email protected]> Cc: Sent: Saturday, July 26, 2014 7:43:39 PM Subject: Re: [Jprogramming] finding matching sets I'm very curious as to how it applies to Joe's data. Seems hard to find other people's benchmarks on it. Joe did mention that the problem leaned towards a hashing approach. I can say that symbols are supposed to help with string matching including memory issues, and that they provide greater improvement the more repetitive your strings are. This latter point is why the context of stock quotes is provided as the canonical example, and I understand to be a core basis for the speed edge of systems like kdb in that domain. I too had hoped there was a guideline for the amount of repetetivity that justifies using symbols. For this problem it may be very little repetition, since for all we know the only purpose of the data is to perform this single search. I would guess there to be a significant speed improvement on the search part, even with little repetition, but have no idea if that improvement is countered by the cost of doing the s: transformation. Another approach, similar but likely slower than symbols would be to build a data structure with one row per key, but with the right column holding a list of indexes into a string or symbol table, as in: 'c1' ; 0 1 'c2' ; 0 1 'c3' ; 0 1 2 RE Boss's solution makes that as first step If you are not necessarily interested in the specific match pairings but rather want a match count, =/~ may be a good next function: =/~ ({."1 </.}."1) ( ({. (, <) [: s: <@:(1&{:: , ' ' , [: ": 2&{::))) "1 t or =/~ 0 1; 0 1 ; 0 1 2 1 1 0 1 1 0 0 0 1 time for a benchmark: first, use boxed indices as suggested above: s0 =. <@:/:~"1 ?10000 10$ 20 timespacex '([: +/ =/)~ s0' NB. counts the number of other identical datasets for each key 6.39809 1.34369e8 with symbols of same (actually similar shaped random ...) data s1 =. s:@:<@:":@:/:~"1 ?10000 10$ 20 timespacex '([: +/ =/)~ s1' 0.215074 1.34369e8 the actual repetition is fairly low for that dataset, (8 I think) */ ([: +/ =/)~ s1 16384 ----- Original Message ----- From: robert therriault <[email protected]> To: [email protected] Cc: Sent: Saturday, July 26, 2014 5:34:27 PM Subject: Re: [Jprogramming] finding matching sets To quote Joey Tuttle from the Jconference...benchmarks. :-) There are a lot of things that go into the decision to use symbols and in the end you test and decide if it provides the performance you want. Cheers, bob On Jul 26, 2014, at 1:21 PM, R.E. Boss <[email protected]> wrote: > What information is available about difference in performance between using > symbols and indices? > > > R.E. Boss > > (Add your info to http://www.jsoftware.com/jwiki/Community/Demographics ) > > > > >> -----Original Message----- >> From: [email protected] [mailto:programming- >> [email protected]] On Behalf Of 'Pascal Jasmin' via >> Programming >> Sent: zaterdag 26 juli 2014 17:20 >> To: [email protected] >> Subject: Re: [Jprogramming] finding matching sets >> >> If your non key columns are very repetitive, you might find symbols >> worthwhile: >> >> ({. (, <) [: s: }.)"1 t >> ┌──┬─────────┐ >> │c1│`p1 `0.25│ >> ├──┼─────────┤ >> │c1│`p2 `0.35│ >> ├──┼─────────┤ >> │c2│`p1 `0.25│ >> ├──┼─────────┤ >> │c2│`p2 `0.35│ >> ├──┼─────────┤ >> │c3│`p1 `0.25│ >> ├──┼─────────┤ >> │c3│`p2 `0.35│ >> ├──┼─────────┤ >> │c3│`p3 `0.45│ >> └──┴─────────┘ >> >> this alternative makes the non key columns into one symbol which should >> make comparisons faster for just your requirement. >> >> ({. (, <) [: s: <@:(1&{:: , ' ' , [: ": 2&{::))"1 t >> ┌──┬────────┐ >> │c1│`p1 0.25│ >> ├──┼────────┤ >> │c1│`p2 0.35│ >> ├──┼────────┤ >> │c2│`p1 0.25│ >> ├──┼────────┤ >> │c2│`p2 0.35│ >> ├──┼────────┤ >> │c3│`p1 0.25│ >> ├──┼────────┤ >> │c3│`p2 0.35│ >> ├──┼────────┤ >> │c3│`p3 0.45│ >> └──┴────────┘ >> >> sticking the bossman's function ahead of it >> >> (([: ~. {."1) </.~ [: (i.~ ~.)({."1 </.}."1)) symboled=. ( ({. (, <) [: s: >> <@:(1&{:: , ' >> ' , [: ": 2&{::))) "1 t >> ┌───────┬────┐ >> │┌──┬──┐│┌──┐│ >> ││c1│c2│││c3││ >> │└──┴──┘│└──┘│ >> └───────┴────┘ >> >> >> ----- Original Message ----- >> From: R.E. Boss <[email protected]> >> To: [email protected] >> Cc: >> Sent: Saturday, July 26, 2014 6:40:45 AM >> Subject: Re: [Jprogramming] finding matching sets >> >> >> (~.{."1 t)</.~ (i.~ ~.)({."1 </.}."1) t >> +-------+----+ >> |+--+--+|+--+| >> ||c1|c2|||c3|| >> |+--+--+|+--+| >> +-------+----+ >> >> If that costs too much memory, try >> >> [T=:(i.~ ~.)"1 &.|:t >> 0 0 0 >> 0 1 1 >> 1 0 0 >> 1 1 1 >> 2 0 0 >> 2 1 1 >> 2 2 2 >> >> (~.{."1 t)</.~ (i.~ ~.)({."1 </.}."1) T >> +-------+----+ >> |+--+--+|+--+| >> ||c1|c2|||c3|| >> |+--+--+|+--+| >> +-------+----+ >> >> >> R.E. Boss >> >> (Add your info to >> http://www.jsoftware.com/jwiki/Community/Demographics ) >> >> >> >>> -----Original Message----- >>> From: [email protected] >> [mailto:programming- >>> [email protected]] On Behalf Of Joe Bogner >>> Sent: vrijdag 25 juli 2014 13:08 >>> To: [email protected] >>> Subject: [Jprogramming] finding matching sets >>> >>> Given the following data: >>> >>> t =: ;: ;._2 (0 : 0) >>> c1 p1 0.25 >>> c1 p2 0.35 >>> c2 p1 0.25 >>> c2 p2 0.35 >>> c3 p1 0.25 >>> c3 p2 0.35 >>> c3 p3 0.45 >>> ) >>> >>> >>> c1 has two rows (p1 0.25) and (p2 0.35) >>> c2 has two rows (p1 0.25) and (p2 0.35) >>> c3 has three rows (p1 025), (p2 0.35), (p3 0.45) >>> >>> How can I identify that c1 and c2 have the same set of values and that c3 >>> is different? >>> >>> I'd like to run the algorithm on a 1.6M row table >>> >>> I created a prototype in javascript using a rough approach, but I haven't >>> translated it to J in case there is a better way: >>> >>> 1. Sort array by column 2 (product) >>> 2. Loop through the array and create a hash table of the concatenated >>> product/value pair (e.g: p2 0.35) for each customer >>> 3. Loop through the hash table and create a list of customers for each >>> unique string of product/value pairs >>> >>> var t = function(){/* >>> c1 p1 0.25 >>> c1 p2 0.35 >>> c2 p1 0.25 >>> c2 p2 0.35 >>> c3 p1 0.25 >>> c3 p2 0.35 >>> c3 p3 0.45 >>> */}.toString().slice(15,-4).split('\n').map(function(x) { return x.split(' >>> ') }) >>> t = t.sort(function(x,y) { return x[1]>y[1] }) >>> >>> var cs = t.reduce(function(memo,val) { memo[val[0]] = >>> (memo[val[0]]||'')+val[1]+val[2]; return memo;}, {}); >>> >>> //JSON.stringify(cs) >>> //"{"c1":"p10.25p20.35","c2":"p10.25p20.35","c3":"p10.25p20.35p30.45"}" >>> >>> var matches = Object.keys(cs).reduce(function(memo,val) { var key = >>> memo[cs[val]] = (memo[cs[val]] || []); key.push(val); return memo;}, {}) >>> >>> JSON.stringify(matches) >>> >>> "{"p10.25p20.35":["c1","c2"],"p10.25p20.35p30.45":["c3"]}" >>> >>> How should this problem be approached in J? >>> ---------------------------------------------------------------------- >>> For information about J forums see >> http://www.jsoftware.com/forums.htm >> >> >> >> ---------------------------------------------------------------------- >> For information about J forums see http://www.jsoftware.com/forums.htm >> >> ---------------------------------------------------------------------- >> For information about J forums see http://www.jsoftware.com/forums.htm > ---------------------------------------------------------------------- > For information about J forums see http://www.jsoftware.com/forums.htm ---------------------------------------------------------------------- For information about J forums see http://www.jsoftware.com/forums.htm ---------------------------------------------------------------------- For information about J forums see http://www.jsoftware.com/forums.htm ---------------------------------------------------------------------- For information about J forums see http://www.jsoftware.com/forums.htm
