Yes wd was used as a way to write information to the jQT window while the
training is in operation. Without it, I would have no way of knowing the status
of the model being trained. I only considered jQT usage and not JHS or console,
so this will not work for those. If you are using JHS/jConsole, I suggest
commenting it out, or just redefining it
wd=: ]
Thanks,
Jon
On Monday, April 29, 2019, 3:10:14 AM GMT+9, Devon McCormick
<[email protected]> wrote:
Hi - trying to run the "fit__pipe" function, I encountered a value error on
this line:
wd^:1 'msgs'
so I commented it out under the assumption this is "wd" defined in JQt and
this is some sort of progress message. Is this correct?
Thanks,
Devon
On Sun, Apr 28, 2019 at 9:20 AM jonghough via Programming <
[email protected]> wrote:
> I think you may be right. Thanks for pointing this out. However, since my
> networks mostly work, I am going to assume that having too many biases
> doesn't negatively impact the results, except for adding "useless"
> calculations. If you are correct, I should fix this.
>
> I have edited the source on a new branch to only have a 2d shaped bias.
> (see:
> https://github.com/jonghough/jlearn/blob/feature/conv2d_layer_fix/adv/conv2d.ijs
> )
> This is not on the master branch, but on a new branch. I am not 100%
> convinced this is correct, and so am going to think about it.
>
> I did, however test it on the MNIST dataset and got about 90% accuracy on
> test data, after 2 epochs (takes a couple of hours to run on a PC). MNIST
> data is not particularly challenging though. Would test it on CIFAR-10 if I
> had somem ore time, but don't at the moment.
>
> The MNIST conv net is:
>
> NB. =================================================
>
>
> PATHTOTRAIN=: '/path/on/my/pc/to/mnist/train/input'
> PATHTOTEST=: '/path/on/my/pc/to/mnist/test/input'
> PATHTOTRAINLABELS=:'/path/on/my/pc/to/mnist/train/labels'
> PATHTOTESTLABELS=: '/path/on/my/pc/to/mnist/test/labels'
> rf=: 1!:1
> data=: a.i. toJ dltb , rf < PATHTOTRAIN
> TRAININPUT =: 255 %~ [ 60000 1 28 28 $, 16}. data
>
> data=: a.i. toJ dltb , rf < PATHTOTEST
> TESTINPUT =: 255 %~ [ 10000 1 28 28 $, 16}. data
>
>
> data=: a.i. toJ dltb , rf < PATHTOTRAINLABELS
> TRAINLABELS =: 60000 10 $ , #: 2^ 8}. data
>
> data=: a.i. toJ dltb , rf < PATHTOTESTLABELS
> TESTLABELS =: 10000 10 $ , #: 2^ 8}. data
>
> pipe=: (100;20;'softmax';1; 'l2';0.0001) conew 'NNPipeline'
> c1=: ((50 1 4 4);3;'relu';'adam';0.01;0) conew 'Conv2D'
> b1 =: (0; 1 ;1e_4;50;0.001) conew 'BatchNorm2D'
> a1 =: 'relu' conew 'Activation'
> c2=: ((64 50 5 5);4;'relu';'adam';0.01;0) conew 'Conv2D'
> b2 =: (0; 1 ;1e_4;64;0.001) conew 'BatchNorm2D'
> a2 =: 'relu' conew 'Activation'
> p1=: 2 conew 'PoolLayer'
> fl=: 1 conew 'FlattenLayer'
> fc1=: (64;34;'tanh';'adam';0.01) conew 'SimpleLayer'
> b3 =: (0; 1 ;1e_4;34;0.001) conew 'BatchNorm'
> a3 =: 'tanh' conew 'Activation'
> fc2=: (34;10;'softmax';'adam';0.01) conew 'SimpleLayer'
> b4 =: (0; 1 ;1e_4;10;0.001) conew 'BatchNorm'
> a4 =: 'softmax' conew 'Activation'
>
> addLayer__pipe c1
> addLayer__pipe b1
> addLayer__pipe a1
> addLayer__pipe c2
> addLayer__pipe b2
> addLayer__pipe a2
> addLayer__pipe p1
> addLayer__pipe fl
> addLayer__pipe fc1
> addLayer__pipe b3
> addLayer__pipe a3
> addLayer__pipe fc2
> addLayer__pipe b4
> addLayer__pipe a4
>
>
>
> TRAINLABELS fit__pipe TRAININPUT
>
> NB. f=: 3 : '+/ ((y + i. 100){TESTLABELS) -:"1 1 (=>./)"1 >{:predict__pipe
> (y+i.100){TESTINPUT'
> NB. run f"0[100*i.100 to run prediction on ALL test set (in batches of
> size 100. Avg the result to get accuracy.
> NB. =================================================
>
> As I said, I am going to go back and look at my notes (don't have them at
> hand). I am sure you are correct, but then, am not 100% convinced that my
> new bias shape is correct. After thinking it through I will probably merge
> the fix.
>
> About backprop for bias, I simply took the ntd (next layer training
> deltas) and averaged them across the first dimension, and then
> multiplied by learn rate, and subtracted from the current bias. This was,
> a fudge from me. Why average? to make the shapes fit. Biases are shared
> between neurons so it makes sense to average the deltas that the bias
> contributes to. As I am sure you have noticed, the actual implementation of
> convnet backprop is the trickiest part, and also the least written about. I
> have a copy of Goodfellow and Bengio's Deep Learning book, which is mostly
> excellent, but even that just skims over backprop for convnets, or gives it
> a very abstract mathematical treatment, but the actual nitty gritty details
> are left to the reader. So my own interpretation of the actual correct
> implementation may be wrong in places (but then again, how wrong can it be,
> if it gets correct answers?). On Sunday, April 28, 2019, 3:34:57 PM
> GMT+9, Brian Schott <[email protected]> wrote:
>
> Jon,
>
> I have been studying your simple_conv_test.ijs example and trying to
> compare it to the *dynamic* example at
> http://cs231n.github.io/convolutional-networks/#conv where only 2 biases
> are used with their stride of 2 and 2 output kernels of shape 3x3. (I
> believe they have 2 biases because of the 2 output kernels.) In contrast,
> according to my manual reconstruction of your verb preRun in conv2d.ijs I
> get a whopping 90 biases (a 10x3x3 array), one for each of the 10 output
> kernels in each of its 3x3 positions on the 8x8 image.
>
> My confusion is that based on the cs231n example, I would have guessed
> that you would have had only 10 biases, not 90. Can you comment on that,
> please?
>
> [Of course in my example below, my `filter` values are ridiculous.
> And I have not adjusted for epochs and batches.
> But I hope the shape of `filter` and the stride of 2 and the ks are
> consistent with your simple example.]
>
>
> filter =: i. 10 3 4 4
> ks =: 2 3$2 2 2 3 4 4
> $A,B,C
> 15 3 8 8
> cf=: [: |:"2 [: |: [: +/ ks filter&(convFunc"3 3);._3 ]
> $n=: cf"3 A,B,C
> 15 10 3 3
> $2 %~ +: 0.5-~ ? ( }. $ n) $ 0 NB. 90 biases
> 10 3 3
>
> Actually, in my own development of a convnet I have been tempted to do as I
> believe you have done, but have been unsuccessful in the backprop step.
> Conceptually how do you combine each group of 3x3 dW's to update their
> common single W/kernel (for example, with summation or mean or max)?
>
> Thanks,
>
> (B=)
> ----------------------------------------------------------------------
> For information about J forums see http://www.jsoftware.com/forums.htm
>
> ----------------------------------------------------------------------
> For information about J forums see http://www.jsoftware.com/forums.htm
>
--
Devon McCormick, CFA
Quantitative Consultant
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm