Now that you mention, I wrote an implementation in J doing just that...

So here it is ... I'd stop reading when you want to give it a go yourself ...

Caution : big spoiler ahead in 10



9




8




7




6




5




4




3




2




1

In my implementation, I took instances as rows, and features
(variables or whatever) in columns.
If known labels Y (shape N x Q) correspond to data X (N x P) , and
Xtest (M x P) are to be classified,

YTest =: k nnClass Y;X;XTest

does the job for binary data. Y can in fact consist of more than one
label per training element, and as such YTest has shape M x Q. Due to
this fact, it's easy to expand to multiclass problems :

YTest =: k nnClass oaa Y;X;XTest

My tests point out the bottleneck keeps being the distance calculation...
All suggestions for improvements are welcome.
Sooner or later I'll put the entire thing on the wiki, including tools
for doing random cross-validation.

Implementation of this beauty below.

NB. Distance between rows, different possibilities:
NB. mind the &.: : true Euclidean distance
NB. d=: +/&.:*:@(-"1)/

NB. D^2 Avoiding squaring
NB. d=: +/&:*:@(-"1)/

NB. D^2 using integer math for speed
d=: +/&:*:@(-"1)/&([: <. (2^20) * ])

NB. x nn y
NB.   takes x nearest neighbors for the test data,
NB.   returning the indices of the NN in the training data.
NB.   If discarding the node itself is needed, add >:@ before i.
nn =: i.@:[ ({ /:)"1 ({: d&> {.)@]

NB. x nnReg y
NB.   where x = k neighbors and
NB.   y =: (training labels) ; (training data) ; (data to be classified)
NB.   k-nn regression is simply taking the average over the label of k
neighbors.
NB.   Taking the mean over "_1 makes it extend to multidimensional y
nnReg =: (+/%#)"_1@((nn }.) { >@{.@])

NB. x nnClass y
NB.   As before: x = k neighbors and
NB.   y =: (training labels) ; (training data) ; (data to be classified)
NB.   Going from regression to classification is a matter of taking
the highest one.
NB.   take the first maximum encountered per row to eliminate any doubles.
NB.   Note that, depending on the application, probably there are better
NB.   ways resolving multiple maxima.
nnClass =: ((i.@# = (i. >./))"1)@:nnReg


NB. As nnClass and nnReg can handle multidimensional labels Y, extening to
NB. a one-against-all multi class classifier is trivial:
oaa =: 1 : 0
:
(>{.y) fromOAA x u toOAA y
)

NB. convert y
NB.   y is the argument as for nnClass / nnReg
toOAA =: |:@=&.>@{. , }.

NB. x deoaa y
NB.   x=: original Y
NB.   y=: class map as multi-dimensional labels from nnClass
NB.  converts from multi-dimensional labels to original ones.
fromOAA =: (] #"1 ~.@[)

Enjoy!

2014-06-10 21:52 GMT+02:00 Joe Bogner <[email protected]>:
> I was reading this article http://huonw.github.io/2014/06/10/knn-rust.html
> via hackernews https://news.ycombinator.com/item?id=7872398
>
> It seems like a good exercise for J. I don't have the time to do it over
> the next few days but would be very interested in the results. I might get
> around to it later in the week if no one else does.
>
> The data files appear to be here:
> https://github.com/c4fsharp/Dojo-Digits-Recognizer/tree/master/Dojo
>
> I liked how clean the factor version was:
> http://re-factor.blogspot.com/2014/06/comparing-k-nn-in-factor.html
> ----------------------------------------------------------------------
> For information about J forums see http://www.jsoftware.com/forums.htm
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Reply via email to