Thank you! I don't know why I stumbled past that in the first place.
I think I got tripped up by messing something up in an earlier
experiment and then recalling Ganesh's comment "Because the data is
boxed, all comparisons must be boxed"

Adding a few more comments in case it helps someone in the future...
Taking a step back, I suspect this version is slow because it has to
open up each box and do the comparison[1]
(6!:2) '(3 : ''mids i. <y'') each sids'
18.0614

Whereas this version does it as a vector operation or triggers some
other handling like looking up indices from a hash? [2]

(6!:2) 'mids i. sids'

0.00921206


Again, a simpler and more clear wins.


   ('a';'b';'c';'q') i. ('a';'a';'b';'b';'a';'c')

0 0 1 1 0 2


Thanks again



[1] - example Javascript implementation

sids=['a','a','b','b','a','c']
mids=['a','b','c','q']
var ret = [];
for(var i=0;i<sids.length;i++) {
  var result=sids.length;
  for(var k=0;k<mids.length;k++) {
     if (sids[i] == mids[k]) {
          result = k;
          break;
     }
  }
  ret.push(result)
}
console.log(ret)

[0, 0, 1, 1, 0, 2]


[2] - hashed lookup

I'm not sure what a vector operation would look like or even if it's
possible. I suspect the performance is lifted by either eliminating
the 2nd loop above or by breaking sooner or getting smarter about the
starting points (e.g. if it internally sorts). I spent some time
looking at vi.c and couldn't figure it out.

sids=['a','a','b','b','a','c']
mids=['a','b','c','q']
midsi={'a':0,'b':1,'c':2,'q':3}
var ret = [];
for(var i=0;i<sids.length;i++) {
  var result=sids.length;
  var lookup = midsi[sids[i]];
  if (lookup != undefined)
      result = lookup;
  ret.push(result)
}
console.log(ret)
[0, 0, 1, 1, 0, 2]

[3] - Both implementations can be played with here: http://jsfiddle.net/Gzvjk/1/

On Wed, Oct 9, 2013 at 1:21 AM, Ric Sherlock <[email protected]> wrote:
> Joe, you suggest that every player should only appear in the master table
> once. If that is the case I think you should be able to simply use the
> primitive i. .
>    m1=: mids i. sids
>
> This gives the same answer as your
>    m=:(3 : 'search y') each Ssids
>
> except that m1 isn't boxed. i.e.
>    m1 -: ; m
>
>     20 (6!:2) 'mids i. sids'
>
> 0.00967119
>
> The only reason that these give a different answer to your idx verb is
> because there are two entries in the master table for the playerID
> <''baezda01'. This appears to be an error in the data because they are in
> fact different players.
>
>     0 1 4 16 17 {"1 mcols, 458 459 { mdata
>
> ┌────────┬────────┬─────────┬─────────┬────────┐
>
> │lahmanID│playerID│birthYear│nameFirst│nameLast│
>
> ├────────┼────────┼─────────┼─────────┼────────┤
>
> │459 │baezda01│1977 │Danys │Baez │
>
> ├────────┼────────┼─────────┼─────────┼────────┤
>
> │460 │baezda01│1953 │Jose │Baez │
>
> └────────┴────────┴─────────┴─────────┴────────┘
>
>
> HTH
>
>
>
> On Wed, Oct 9, 2013 at 3:34 PM, Joe Bogner <[email protected]> wrote:
>
>> Thanks everyone and thank you for the feedback Dan. Yes, I've been at it a
>> little over a month. I'm working with it nearly every day for a few hours
>> most days. I spend probably 50% of my days in R, Excel and SQL so I'm
>> always looking for a new trick or new approach even if it's not a vastly
>> superior approach.
>>
>> I have one more example to share that rounds out the slicing and dicing
>> theme. Like many things the first time I try them with J, I thought it
>> would take 15 minutes and ended up turning into several hours.
>>
>> Earlier I defined an idx verb: idx=:13 : '(; > L:0 I. each <"1 (x =/ y))',
>> which Pascal and Ric offered improvements to.
>>
>> I was excited to use it to join tables. It worked very well on small
>> tables. A common problem with the type of my daily data analysis is to
>> merge tables together. The baseball example was perfect.
>>
>>
>> idx=:13 : '(; > L:0 I. each <"1 (x =/ y))'
>>
>>
>> 'mcols mdata'=. split readcsv '/temp/Master.csv'
>>
>> 'scols sdata'=. split readcsv '/temp/Salaries.csv'
>>
>>
>> mids=. (mcols i. <'playerID') {"1 mdata
>>
>> sids=. (scols i. <'playerID') {"1 sdata
>>
>>
>> NB. Ouch, way too slow. It ends up building a 23141 18125 array
>>
>> NB. (6!:2) '(sids idx mids)'
>>
>> NB. 54.3933
>>
>>
>> NB. Also too slow.
>>
>> NB. (6!:2) '(3 : ''mids I. @:= <y'') each sids'
>>
>> NB. 43.321
>>
>>
>> NB. Getting better. I don't need every index in the master since a player
>> should only be there once. Still too slow
>> NB. (6!:2) '(3 : ''mids i. <y'') each sids'
>>
>> NB. 18.0614
>>
>>
>> NB. Try a fast search http://www.jsoftware.com/help/release/midot.htm
>>
>> NB.  search =: mids&i.
>>
>> NB. (6!:2) ']m=.(3 : ''search <y'') each sids'
>>
>> NB. 9.76146
>>
>>
>> NB. Can we do better by by sorting?
>>
>> NB. sorttbl=: ]/:{"1 NB. http://www.jsoftware.com/jwiki/JPhrases/Sorting
>>
>> NB. midxtbl =. 1 sorttbl (> (3 : 'y;(y{"1 mids)' ) each i. # mids)
>>
>> NB. sidxtbl =. 1 sorttbl (> (3 : 'y;(y{"1 sids)' ) each i. # sids)
>>
>>
>> NB. sortedMids=. 1{"1 midxtbl
>>
>> NB. sortedSids=. 1{"1 midxtbl
>>
>>
>> NB. That was a lot of work for 2 seconds, not to mention another level of
>> indirection
>>
>> NB. 6!:2 'm=.(3 : ''search <y'') each sortedSids'
>>
>> NB. 7.2621
>>
>> This is where I went off the rails. I started to look into hash tables and
>> sparse arrays. Lesson re-learned -- if a language is older than X years,
>> then it's likely it doesn't need what you think it does to solve the
>> problem.
>>
>> Manually implementing hash tables is no fun and stealing the hash key from
>> a symbol table isn't very fast either.
>>
>> That reminded me of symbols though, which I had experimented with already.
>>
>> Final solution - use symbols: Boxed strings are slow for comparison. I knew
>> that, but it didn't occur to me.
>>
>>
>>
>> Smids=. s:mids
>>
>> Ssids=. s:sids
>>
>> search =: Smids&i.
>>
>>
>>
>> NB. I had to run it a few times to make sure I wasn't missing something
>>
>> NB. BLAZING FAST
>>
>> 6!:2 'm=.(3 : ''search y'') each Ssids'
>>
>> NB. 0.031005
>>
>>
>> Now that we have our merge index, we can do some awesome things
>>
>>
>>
>> msd=. (m { mdata),.sdata
>>
>> msc=. mcols,scols
>>
>>
>> playerSalaries=.(('playerID';'nameFirst';'yearID';'salary') idx msc) {"1
>> msd
>>
>> hankOrTommy=.(('Hank';'Tommy') idx ((msc i.<'nameFirst') {"1 msd)) { msd
>>
>> vert =: |:@:(<"_1&>)
>>
>>    sumit=:13 : '(~.x);(i.~x) +/ /. y'
>>
>> ]vert (> (msc i.<'nameFirst'){"1 hankOrTommy) sumit (".>((msc
>> i.<'salary'){"1 hankOrTommy))
>>
>> ┌─────┬────────┐
>>
>> │Hank │22366500│
>>
>> ├─────┼────────┤
>>
>> │Tommy│12724010│
>>
>> └─────┴────────┘
>>
>>
>>
>> NB. Sum by Name and Year
>>
>> yearkey=: > ((msc i.<'nameFirst'){"1 hankOrTommy) , each ((msc
>> i.<'yearID'){"1 hankOrTommy)
>>
>> 3 {. vert yearkey sumit (".>((msc i.<'salary'){"1 hankOrTommy))
>>
>>
>> ┌─────────┬──────┐
>>
>> │Hank2002 │200000│
>>
>> ├─────────┼──────┤
>>
>> │Hank2003 │302500│
>>
>> ├─────────┼──────┤
>>
>> │Hank2004 │550000│
>>
>> └─────────┴──────┘
>>
>>
>>
>> Sidenote: symbols are great for keeping memory down too. By default each
>> string is a copy in memory. Symbols keeps only a single copy and references
>> a hash table, like R's string table.
>> http://cran.r-project.org/doc/manuals/R-ints.html#The-CHARSXP-cache. You
>> can use symbols instead of strings when doing reading from files if you
>> manually read them using cut fread, etc. I'll write that up some time.
>>
>>
>> On Tue, Oct 8, 2013 at 9:38 AM, Pascal Jasmin <[email protected]
>> >wrote:
>>
>> > The only hacky part is that you have included a list inside your verb.
>> >
>> > If you just wanted indices:
>> >
>> > findI =. (I.leaf @:<@:E.)"0 1
>> >
>> > ('a';'b';'q') findI  ('a';'b';'z';'c';'a';'q')
>> > ┌───┬─┬─┐
>> >
>> > │0 4│1│5│
>> > └───┴─┴─┘
>> >
>> > if that seems noisy to you, consider:
>> >    ('a';'b';'q') (I.)"0 1  ('a';'b';'z';'c';'a';'q')
>> > 0 1 1 1 0 1
>> > 1 0 1 1 1 1
>> > 1 1 1 1 1 0
>> >
>> > E. is basically -.@:I, here. if you haven't used it before.
>> >
>> >
>> > ----- Original Message -----
>> >
>> > Is there a better way to find indices of a subset within a greater list?
>> > This is my hacky solution
>> >
>> >
>> > find=.('a';'b';'q')
>> >
>> > list=.('a';'b';'z';'c';'a';'q')
>> >
>> >
>> >   ] (; > L:0 (3 : '(<y) ([: I. =) list' ) each find) { list
>> >
>> > ┌─┬─┬─┬─┐
>> >
>> > │a│a│b│q│
>> >
>> > └─┴─┴─┴─┘
>> >
>> >
>> > find xxx list
>> > ----------------------------------------------------------------------
>> > For information about J forums see http://www.jsoftware.com/forums.htm
>> >
>> ----------------------------------------------------------------------
>> For information about J forums see http://www.jsoftware.com/forums.htm
>>
> ----------------------------------------------------------------------
> For information about J forums see http://www.jsoftware.com/forums.htm
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Reply via email to