Thanks Jan-Pieter, how would I recreate the results of the calculating the % correct with yours? I will give it a shot on my own still later.. I pasted some code to help jumpstart the reading of the array of data:
For anyone who's interested, here's a small comparison of different techniques to load the data: require 'csv' NB. from https://raw.githubusercontent.com/c4fsharp/Dojo-Digits-Recognizer/master/Dojo/trainingsample.csv PATH =: 'c:/users/jbogner/downloads/trainingsample.csv' parsefile =: 3 : 0 training_file =: fread PATH header_end =: >: training_file i. LF arr =: ". ];._2 header_end }. training_file ) parsecsv =: 3 : 0 arr =: ". each }. readcsv PATH ) Note 'timing of fread' timespacex 'parsefile''''' 3.94344 7.48989e7 $ arr 5000 785 ) Note 'timing of parsecsv' timespacex 'parsecsv''''' 25.6972 4.01359e9 ) fread was the way to go vs parsecsv since parsecsv needed to convert to convert each boxed cell to a number (5000x785 matrix) Since each line is a comma delimited number, ". can read each each line and parse it without boxing ". '1,2,3,4' 1 2 3 4 On Tue, Jun 10, 2014 at 7:06 PM, Raul Miller <[email protected]> wrote: > If you are only comparing distances (and not measuring them), you can skip > the square root and just compare the magnitudes of the sums of the squares. > > (Generally speaking, eliminating unnecessary operations is how you make > things faster - this can be taken to the point of silliness (code golf) but > if you run timings on representative data sets, that can help build up your > understanding about where time gets spent.) > > That said, I've not taken enough time to study this problem to understand > the data. (For me, understanding the data is usually harder than > understanding the code. And, even when the code needs effort to understand, > I need to understand the data before I can even think about trying to > understand the code.) > > Thanks, > > -- > Raul > > > > On Tue, Jun 10, 2014 at 4:57 PM, Jan-Pieter Jacobs < > [email protected]> wrote: > > > Now that you mention, I wrote an implementation in J doing just that... > > > > So here it is ... I'd stop reading when you want to give it a go yourself > > ... > > > > Caution : big spoiler ahead in 10 > > > > > > > > 9 > > > > > > > > > > 8 > > > > > > > > > > 7 > > > > > > > > > > 6 > > > > > > > > > > 5 > > > > > > > > > > 4 > > > > > > > > > > 3 > > > > > > > > > > 2 > > > > > > > > > > 1 > > > > In my implementation, I took instances as rows, and features > > (variables or whatever) in columns. > > If known labels Y (shape N x Q) correspond to data X (N x P) , and > > Xtest (M x P) are to be classified, > > > > YTest =: k nnClass Y;X;XTest > > > > does the job for binary data. Y can in fact consist of more than one > > label per training element, and as such YTest has shape M x Q. Due to > > this fact, it's easy to expand to multiclass problems : > > > > YTest =: k nnClass oaa Y;X;XTest > > > > My tests point out the bottleneck keeps being the distance calculation... > > All suggestions for improvements are welcome. > > Sooner or later I'll put the entire thing on the wiki, including tools > > for doing random cross-validation. > > > > Implementation of this beauty below. > > > > NB. Distance between rows, different possibilities: > > NB. mind the &.: : true Euclidean distance > > NB. d=: +/&.:*:@(-"1)/ > > > > NB. D^2 Avoiding squaring > > NB. d=: +/&:*:@(-"1)/ > > > > NB. D^2 using integer math for speed > > d=: +/&:*:@(-"1)/&([: <. (2^20) * ]) > > > > NB. x nn y > > NB. takes x nearest neighbors for the test data, > > NB. returning the indices of the NN in the training data. > > NB. If discarding the node itself is needed, add >:@ before i. > > nn =: i.@:[ ({ /:)"1 ({: d&> {.)@] > > > > NB. x nnReg y > > NB. where x = k neighbors and > > NB. y =: (training labels) ; (training data) ; (data to be classified) > > NB. k-nn regression is simply taking the average over the label of k > > neighbors. > > NB. Taking the mean over "_1 makes it extend to multidimensional y > > nnReg =: (+/%#)"_1@((nn }.) { >@{.@]) > > > > NB. x nnClass y > > NB. As before: x = k neighbors and > > NB. y =: (training labels) ; (training data) ; (data to be classified) > > NB. Going from regression to classification is a matter of taking > > the highest one. > > NB. take the first maximum encountered per row to eliminate any > doubles. > > NB. Note that, depending on the application, probably there are better > > NB. ways resolving multiple maxima. > > nnClass =: ((i.@# = (i. >./))"1)@:nnReg > > > > > > NB. As nnClass and nnReg can handle multidimensional labels Y, extening > to > > NB. a one-against-all multi class classifier is trivial: > > oaa =: 1 : 0 > > : > > (>{.y) fromOAA x u toOAA y > > ) > > > > NB. convert y > > NB. y is the argument as for nnClass / nnReg > > toOAA =: |:@=&.>@{. , }. > > > > NB. x deoaa y > > NB. x=: original Y > > NB. y=: class map as multi-dimensional labels from nnClass > > NB. converts from multi-dimensional labels to original ones. > > fromOAA =: (] #"1 ~.@[) > > > > Enjoy! > > > > 2014-06-10 21:52 GMT+02:00 Joe Bogner <[email protected]>: > > > I was reading this article > > http://huonw.github.io/2014/06/10/knn-rust.html > > > via hackernews https://news.ycombinator.com/item?id=7872398 > > > > > > It seems like a good exercise for J. I don't have the time to do it > over > > > the next few days but would be very interested in the results. I might > > get > > > around to it later in the week if no one else does. > > > > > > The data files appear to be here: > > > https://github.com/c4fsharp/Dojo-Digits-Recognizer/tree/master/Dojo > > > > > > I liked how clean the factor version was: > > > http://re-factor.blogspot.com/2014/06/comparing-k-nn-in-factor.html > > > ---------------------------------------------------------------------- > > > For information about J forums see http://www.jsoftware.com/forums.htm > > ---------------------------------------------------------------------- > > For information about J forums see http://www.jsoftware.com/forums.htm > > > ---------------------------------------------------------------------- > For information about J forums see http://www.jsoftware.com/forums.htm > ---------------------------------------------------------------------- For information about J forums see http://www.jsoftware.com/forums.htm
