Mike Dusenberry created SYSTEMML-1814:
-----------------------------------------

             Summary: Improve slide distribution of the image dataset via 
improved filtering 
                 Key: SYSTEMML-1814
                 URL: https://issues.apache.org/jira/browse/SYSTEMML-1814
             Project: SystemML
          Issue Type: Improvement
            Reporter: Mike Dusenberry
            Assignee: Mike Dusenberry


Currently, our models are heavily overfitting on the training dataset.  
However, further evaluation has shown that this is not the usual overfitting 
due to an over-expressive model -- in this case we are employing heavy model 
freezing (as much as only unfreezing the final softmax classifier of a 
pretrained ResNet50).  Therefore, my evaluation has led me to believe that this 
is likely due to batch effects in the data, and examination of the original 
slide distribution in the sample images dataset has shown a severe imbalance.

{code}
     slide_num  count
0          436      1
1          116      1
2          468      2
3           38      3
4          195      4
5          173      5
6           13      7
7          481      8
8           83      9
9          349     11
10         490     15
11         292     17
12         281     22
13         387     26
14         326     32
15         286     32
16          88     39
17         477     48
18         205     57
19         135     58
20         127     58
21          16     61
22         245     66
23           5     81
24         306     83
25         284     91
26         263    100
27          15    120
28         345    124
29         380    128
30          24    137
31         382    150
32           1    154
33         421    164
34         163    169
35         278    171
36         235    197
37         332    197
38         343    207
39          43    237
40         249    246
41         113    256
42         496    262
43         482    264
44          86    269
45         415    269
46         472    326
47         422    329
48         450    340
49         108    348
50           3    390
51         191    402
52         272    474
53          85    483
54          97    484
55         210    508
56         293    544
57          41    595
58         452    613
59         220    613
60         406    651
61          67    665
62         260    666
63         361    673
64         269    684
65          50    684
66         304    753
67         101    769
68         433    868
69           4    898
70         499    915
71         145    917
72         357    918
73         365    940
74          82    951
75         126    965
76         185    965
77         164   1077
78         221   1086
79         165   1111
80         316   1129
81         350   1132
82          89   1162
83          19   1169
84          74   1206
85         132   1248
86          47   1278
87         188   1297
88         459   1312
89         368   1337
90         335   1368
91         225   1373
92         234   1378
93         487   1385
94         247   1464
95         427   1476
96          65   1492
97         402   1500
98         315   1557
99         201   1604
100        344   1607
101        273   1616
102        146   1623
103        341   1636
104        425   1640
105        182   1681
106        403   1682
107        275   1690
108        457   1717
109        448   1724
110        277   1729
111         70   1740
112        141   1747
113        264   1777
114        122   1880
115        319   1915
116        449   1951
117        104   1988
118        377   1993
119        285   2008
120        107   2084
121        410   2141
122         11   2148
123        367   2153
124        416   2162
125        311   2183
126        338   2206
127         51   2233
128        153   2255
129        144   2285
130        497   2358
131        218   2364
132        330   2376
133        308   2392
134        213   2480
135        454   2512
136        103   2567
137        446   2569
138         40   2622
139        251   2629
140        149   2632
141        455   2633
142        430   2669
143        262   2715
144         76   2737
145         18   2748
146        178   2763
147        383   2864
148         54   2871
149        223   2908
150        207   2931
151        486   3043
152        391   3099
153        342   3104
154        390   3116
155        276   3136
156         75   3141
157        181   3171
158        142   3213
159        414   3255
160        137   3276
161        295   3285
162        358   3315
163          7   3322
164        323   3327
165         71   3334
166        243   3344
167        120   3359
168         48   3371
169        434   3387
170        206   3404
171          9   3460
172        476   3467
173         32   3472
174        491   3496
175        444   3502
176        279   3530
177         59   3546
178        174   3556
179        464   3595
180        392   3633
181         99   3677
182         72   3682
183        347   3779
184         28   3804
185        314   3807
186        322   3809
187        492   3823
188        258   3824
189        230   3831
190        354   3887
191        346   3951
192        445   3963
193        209   3969
194          8   3986
195        443   3988
196        290   3993
197        118   4025
198        152   4026
199         56   4078
200        170   4131
201         84   4146
202        413   4150
203        447   4171
204        417   4193
205         60   4210
206         92   4265
207        374   4281
208         94   4307
209        161   4360
210        320   4408
211        114   4451
212        219   4480
213         90   4518
214        233   4528
215        396   4596
216        157   4661
217        117   4696
218        337   4724
219        202   4819
220         34   4827
221        105   4840
222        155   4841
223        176   4895
224        166   4966
225        456   5031
226        254   5085
227        475   5184
228         42   5221
229        172   5330
230        299   5358
231        473   5364
232        131   5369
233         61   5382
234        379   5470
235        355   5488
236        372   5496
237         53   5503
238         17   5523
239        495   5529
240        190   5536
241        451   5583
242        177   5630
243        123   5649
244        231   5686
245        217   5692
246         33   5742
247         55   5767
248        388   5786
249        318   5819
250         81   5838
251         62   5846
252        255   5854
253        485   5890
254        375   5928
255        156   5938
256        224   5945
257        267   5970
258        412   5987
259        136   6038
260        160   6055
261        240   6084
262         39   6093
263        469   6100
264        300   6167
265        183   6178
266        250   6195
267         49   6231
268        471   6251
269        334   6283
270        265   6422
271        407   6468
272        252   6472
273        466   6478
274        227   6528
275        102   6550
276        458   6653
277        140   6667
278        133   6668
279        493   6716
280        465   6729
281        370   6751
282        244   6772
283        216   6772
284        488   6773
285         95   6777
286         52   6788
287         57   6821
288        289   6846
289        362   6939
290        180   6944
291        324   6961
292        211   7012
293         73   7034
294        301   7094
295         23   7106
296         64   7169
297        420   7182
298         36   7219
299        376   7257
300        484   7265
301        253   7275
302        470   7312
303        460   7405
304         98   7425
305        302   7427
306        393   7435
307        159   7554
308        237   7564
309        274   7701
310        359   7769
311         68   7779
312        483   7829
313        151   7910
314        186   7948
315        442   7952
316        259   8049
317        246   8128
318         96   8129
319        271   8176
320        438   8190
321         87   8197
322        162   8226
323        489   8260
324        418   8312
325         31   8504
326        179   8532
327         79   8578
328        226   8600
329         27   8719
330        479   8862
331        268   8883
332        404   8908
333         46   8913
334        437   8961
335        147   9047
336        189   9164
337         20   9242
338        386   9356
339        435   9376
340        432   9495
341        408   9505
342        248   9509
343        462   9619
344        229   9774
345        193   9835
346        167   9871
347         69   9894
348        130   9954
349        327  10072
350        369  10078
351        106  10180
352        194  10212
353        325  10306
354        312  10344
355        303  10502
356        184  10655
357        463  10916
358        426  11055
359        283  11334
360        328  11450
361        129  11467
362        288  11806
363        124  12010
364        171  12250
365        121  12257
366         22  12276
367        423  12310
368        192  12313
369        378  12358
370        307  12366
371        143  12678
372         80  12899
373         66  12920
374        208  12970
375        158  13131
376        148  13423
377        119  13723
378        317  13830
379        395  13834
380        187  14003
381         25  14856
382        399  14905
383        478  16145
384         93  20009
385        215  20723
{code}

This task will aim to improve the preprocessing algorithm to yield a more even 
slide distribution in the final image dataset, hopefully reducing the batch 
effects, and leading to improved model metric performance.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to