Re: [Jprogramming] Slicing and dicing tables

Joe Bogner Tue, 08 Oct 2013 19:35:08 -0700

Thanks everyone and thank you for the feedback Dan. Yes, I've been at it a
little over a month. I'm working with it nearly every day for a few hours
most days. I spend probably 50% of my days in R, Excel and SQL so I'm
always looking for a new trick or new approach even if it's not a vastly
superior approach.

I have one more example to share that rounds out the slicing and dicing
theme. Like many things the first time I try them with J, I thought it
would take 15 minutes and ended up turning into several hours.

Earlier I defined an idx verb: idx=:13 : '(; > L:0 I. each <"1 (x =/ y))',
which Pascal and Ric offered improvements to.

I was excited to use it to join tables. It worked very well on small
tables. A common problem with the type of my daily data analysis is to
merge tables together. The baseball example was perfect.

idx=:13 : '(; > L:0 I. each <"1 (x =/ y))'

'mcols mdata'=. split readcsv '/temp/Master.csv'

'scols sdata'=. split readcsv '/temp/Salaries.csv'

mids=. (mcols i. <'playerID') {"1 mdata

sids=. (scols i. <'playerID') {"1 sdata

NB. Ouch, way too slow. It ends up building a 23141 18125 array

NB. (6!:2) '(sids idx mids)'

NB. 54.3933

NB. Also too slow.

NB. (6!:2) '(3 : ''mids I. @:= <y'') each sids'

NB. 43.321

NB. Getting better. I don't need every index in the master since a player
should only be there once. Still too slow
NB. (6!:2) '(3 : ''mids i. <y'') each sids'

NB. 18.0614

NB. Try a fast search http://www.jsoftware.com/help/release/midot.htm

NB.  search =: mids&i.

NB. (6!:2) ']m=.(3 : ''search <y'') each sids'

NB. 9.76146

NB. Can we do better by by sorting?

NB. sorttbl=: ]/:{"1 NB. http://www.jsoftware.com/jwiki/JPhrases/Sorting

NB. midxtbl =. 1 sorttbl (> (3 : 'y;(y{"1 mids)' ) each i. # mids)

NB. sidxtbl =. 1 sorttbl (> (3 : 'y;(y{"1 sids)' ) each i. # sids)

NB. sortedMids=. 1{"1 midxtbl

NB. sortedSids=. 1{"1 midxtbl

NB. That was a lot of work for 2 seconds, not to mention another level of
indirection

NB. 6!:2 'm=.(3 : ''search <y'') each sortedSids'

NB. 7.2621

This is where I went off the rails. I started to look into hash tables and
sparse arrays. Lesson re-learned -- if a language is older than X years,
then it's likely it doesn't need what you think it does to solve the
problem.

Manually implementing hash tables is no fun and stealing the hash key from
a symbol table isn't very fast either.

That reminded me of symbols though, which I had experimented with already.

Final solution - use symbols: Boxed strings are slow for comparison. I knew
that, but it didn't occur to me.

Smids=. s:mids

Ssids=. s:sids

search =: Smids&i.

NB. I had to run it a few times to make sure I wasn't missing something

NB. BLAZING FAST

6!:2 'm=.(3 : ''search y'') each Ssids'

NB. 0.031005

Now that we have our merge index, we can do some awesome things

msd=. (m { mdata),.sdata

msc=. mcols,scols

playerSalaries=.(('playerID';'nameFirst';'yearID';'salary') idx msc) {"1 msd

hankOrTommy=.(('Hank';'Tommy') idx ((msc i.<'nameFirst') {"1 msd)) { msd

vert =: |:@:(<"_1&>)

   sumit=:13 : '(~.x);(i.~x) +/ /. y'

]vert (> (msc i.<'nameFirst'){"1 hankOrTommy) sumit (".>((msc
i.<'salary'){"1 hankOrTommy))

┌─────┬────────┐

│Hank │22366500│

├─────┼────────┤

│Tommy│12724010│

└─────┴────────┘

NB. Sum by Name and Year

yearkey=: > ((msc i.<'nameFirst'){"1 hankOrTommy) , each ((msc
i.<'yearID'){"1 hankOrTommy)

3 {. vert yearkey sumit (".>((msc i.<'salary'){"1 hankOrTommy))

┌─────────┬──────┐

│Hank2002 │200000│

├─────────┼──────┤

│Hank2003 │302500│

├─────────┼──────┤

│Hank2004 │550000│

└─────────┴──────┘

Sidenote: symbols are great for keeping memory down too. By default each
string is a copy in memory. Symbols keeps only a single copy and references
a hash table, like R's string table.
http://cran.r-project.org/doc/manuals/R-ints.html#The-CHARSXP-cache. You
can use symbols instead of strings when doing reading from files if you
manually read them using cut fread, etc. I'll write that up some time.

On Tue, Oct 8, 2013 at 9:38 AM, Pascal Jasmin <[email protected]>wrote:

> The only hacky part is that you have included a list inside your verb.
>
> If you just wanted indices:
>
> findI =. (I.leaf @:<@:E.)"0 1
>
> ('a';'b';'q') findI  ('a';'b';'z';'c';'a';'q')
> ┌───┬─┬─┐
>
> │0 4│1│5│
> └───┴─┴─┘
>
> if that seems noisy to you, consider:
>    ('a';'b';'q') (I.)"0 1  ('a';'b';'z';'c';'a';'q')
> 0 1 1 1 0 1
> 1 0 1 1 1 1
> 1 1 1 1 1 0
>
> E. is basically -.@:I, here. if you haven't used it before.
>
>
> ----- Original Message -----
>
> Is there a better way to find indices of a subset within a greater list?
> This is my hacky solution
>
>
> find=.('a';'b';'q')
>
> list=.('a';'b';'z';'c';'a';'q')
>
>
>   ] (; > L:0 (3 : '(<y) ([: I. =) list' ) each find) { list
>
> ┌─┬─┬─┬─┐
>
> │a│a│b│q│
>
> └─┴─┴─┴─┘
>
>
> find xxx list
> ----------------------------------------------------------------------
> For information about J forums see http://www.jsoftware.com/forums.htm
>
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Re: [Jprogramming] Slicing and dicing tables

Reply via email to