Hi everybody -

I've formed a Kaggle team to tackle the Higgs Boson problem discussed
earlier on these forums.  There are still two slots open if anyone else
would like to join my team - first come, first served.  I, myself may be
somewhat busy with other things over the next few weeks as I've a trip
planned to England and I'll be going to the J conference on July 24-25.
 Presumably I'll have finished crafting a talk by then as well.

If anyone wishes to form a competing team, please do.  In the interests of
promoting a good J turn-out, I'm making available the the trivial code I've
put together so far for the the basic grunt work of reading in the input
files and making a file to submit (below).

Good luck!

Devon

-- 
Devon McCormick, CFA
​---
require 'tables/dsv'

NB.* getTrainingData: break apart .csv file of training data into useful
NB. arrays.
getTrainingData=: 3 : 0
   'trntit trn'=. split (',';'') readdsv y   NB. Split title row from data
   trn=. }:"1 trn [ lbls=. ;_1{"1 trn   NB. Separate character labels from
numbers
   trn=. }."1 trn [ evid=. 0{"1 trn     NB. Pull off event IDs as vec
   trn=. ".&>trn                        NB. Data as simple numeric mat
   trn=. }:"1 trn [ wts=. _1{"1 trn     NB. Pull off weights column as vec
   trntit;lbls;wts;evid;<trn
NB.EG 'trntit lbls wts evid trn'=. getTrainingData 'Data/training.csv'
)

NB.* getTestData: read .csv file of test data into useful arrays.
getTestData=: 3 : 0
   'tsttit tst'=. split (',';'') readdsv y   NB. Split title row from data
   tst=. }."1 tst [ evidt=. 0{"1 tst         NB. Pull off event IDs as vec
   tst=. ".&>tst                             NB. Test data as simple
numeric mat
   tsttit;evidt;<tst
NB.EG 'tsttit evidt tst'=. getTestData 'Data/test.csv'
)

NB.* calcAMS: calculate AMS metric based on R code in AMSscore.R.
calcAMS=: 3 : 0
   'wts actual guesses'=. y  NB. Weights vector, actual values, predicted
values.
NB. Sum signal weights according to predicted label 's' or 'b'.
   's b'=. guesses +//. wts*actual='s'
   's b'=. ('s'~:{.guesses)|.s;b        NB. Correct order from guess order
   br=. 10  NB. Regularization term to reduce variance of measure.
   %:+:s-~(s+b+br)*^.>:s%b+br
)

NB.** Example of using the above code to create a submission file:
NB.   1!:44 pp=. {path to code}
   'trntit lbls wts evid trn'=. getTrainingData 'Data/training.csv'
   'tsttit evidt tst'=. getTestData 'Data/test.csv'

NB. Initial attempt: simple regression.
   coeffs=. (lbls='s')%.trn        NB. Regress s=1 on factors
   ests=. trn+/ . * coeffs         NB. Estimates based on regression

   's'+/ . = lbls                  NB. Number of signals
85667
   lbls #/. lbls                   NB. # 's' vs. 'b'
85667 164333
   ('s'={.lbls)|.lbls #/. lbls     NB. Put 'b' # first
164333 85667

   threshold=. (/:~ests){~-'s'+/ . = lbls  NB. Guess threshold for correct
# of each signal
   (ests>:threshold) #/. ests NB. Verify correct number 'b' vs. 's'
164333 85667

   guess1=. 'bs'{~ests>:threshold  NB. Estimates to labels: 's' signal, 'b'
background.
   calcAMS wts;lbls;guess1         NB. Measure of goodness is 19.8468
(higher is better)
19.8468

NB. Build submission file.
   submission=.
('EventId';'RankOrder';'Class'),(":&.>350000+i.#ests),.(":&.>>:/:/:ests),.<"0
guess1
   submission writedsv 'Data/NEJedi0.csv';',';''

​
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Reply via email to