Mike Dusenberry created SYSTEMML-1814:
-----------------------------------------
Summary: Improve slide distribution of the image dataset via
improved filtering
Key: SYSTEMML-1814
URL: https://issues.apache.org/jira/browse/SYSTEMML-1814
Project: SystemML
Issue Type: Improvement
Reporter: Mike Dusenberry
Assignee: Mike Dusenberry
Currently, our models are heavily overfitting on the training dataset.
However, further evaluation has shown that this is not the usual overfitting
due to an over-expressive model -- in this case we are employing heavy model
freezing (as much as only unfreezing the final softmax classifier of a
pretrained ResNet50). Therefore, my evaluation has led me to believe that this
is likely due to batch effects in the data, and examination of the original
slide distribution in the sample images dataset has shown a severe imbalance.
{code}
slide_num count
0 436 1
1 116 1
2 468 2
3 38 3
4 195 4
5 173 5
6 13 7
7 481 8
8 83 9
9 349 11
10 490 15
11 292 17
12 281 22
13 387 26
14 326 32
15 286 32
16 88 39
17 477 48
18 205 57
19 135 58
20 127 58
21 16 61
22 245 66
23 5 81
24 306 83
25 284 91
26 263 100
27 15 120
28 345 124
29 380 128
30 24 137
31 382 150
32 1 154
33 421 164
34 163 169
35 278 171
36 235 197
37 332 197
38 343 207
39 43 237
40 249 246
41 113 256
42 496 262
43 482 264
44 86 269
45 415 269
46 472 326
47 422 329
48 450 340
49 108 348
50 3 390
51 191 402
52 272 474
53 85 483
54 97 484
55 210 508
56 293 544
57 41 595
58 452 613
59 220 613
60 406 651
61 67 665
62 260 666
63 361 673
64 269 684
65 50 684
66 304 753
67 101 769
68 433 868
69 4 898
70 499 915
71 145 917
72 357 918
73 365 940
74 82 951
75 126 965
76 185 965
77 164 1077
78 221 1086
79 165 1111
80 316 1129
81 350 1132
82 89 1162
83 19 1169
84 74 1206
85 132 1248
86 47 1278
87 188 1297
88 459 1312
89 368 1337
90 335 1368
91 225 1373
92 234 1378
93 487 1385
94 247 1464
95 427 1476
96 65 1492
97 402 1500
98 315 1557
99 201 1604
100 344 1607
101 273 1616
102 146 1623
103 341 1636
104 425 1640
105 182 1681
106 403 1682
107 275 1690
108 457 1717
109 448 1724
110 277 1729
111 70 1740
112 141 1747
113 264 1777
114 122 1880
115 319 1915
116 449 1951
117 104 1988
118 377 1993
119 285 2008
120 107 2084
121 410 2141
122 11 2148
123 367 2153
124 416 2162
125 311 2183
126 338 2206
127 51 2233
128 153 2255
129 144 2285
130 497 2358
131 218 2364
132 330 2376
133 308 2392
134 213 2480
135 454 2512
136 103 2567
137 446 2569
138 40 2622
139 251 2629
140 149 2632
141 455 2633
142 430 2669
143 262 2715
144 76 2737
145 18 2748
146 178 2763
147 383 2864
148 54 2871
149 223 2908
150 207 2931
151 486 3043
152 391 3099
153 342 3104
154 390 3116
155 276 3136
156 75 3141
157 181 3171
158 142 3213
159 414 3255
160 137 3276
161 295 3285
162 358 3315
163 7 3322
164 323 3327
165 71 3334
166 243 3344
167 120 3359
168 48 3371
169 434 3387
170 206 3404
171 9 3460
172 476 3467
173 32 3472
174 491 3496
175 444 3502
176 279 3530
177 59 3546
178 174 3556
179 464 3595
180 392 3633
181 99 3677
182 72 3682
183 347 3779
184 28 3804
185 314 3807
186 322 3809
187 492 3823
188 258 3824
189 230 3831
190 354 3887
191 346 3951
192 445 3963
193 209 3969
194 8 3986
195 443 3988
196 290 3993
197 118 4025
198 152 4026
199 56 4078
200 170 4131
201 84 4146
202 413 4150
203 447 4171
204 417 4193
205 60 4210
206 92 4265
207 374 4281
208 94 4307
209 161 4360
210 320 4408
211 114 4451
212 219 4480
213 90 4518
214 233 4528
215 396 4596
216 157 4661
217 117 4696
218 337 4724
219 202 4819
220 34 4827
221 105 4840
222 155 4841
223 176 4895
224 166 4966
225 456 5031
226 254 5085
227 475 5184
228 42 5221
229 172 5330
230 299 5358
231 473 5364
232 131 5369
233 61 5382
234 379 5470
235 355 5488
236 372 5496
237 53 5503
238 17 5523
239 495 5529
240 190 5536
241 451 5583
242 177 5630
243 123 5649
244 231 5686
245 217 5692
246 33 5742
247 55 5767
248 388 5786
249 318 5819
250 81 5838
251 62 5846
252 255 5854
253 485 5890
254 375 5928
255 156 5938
256 224 5945
257 267 5970
258 412 5987
259 136 6038
260 160 6055
261 240 6084
262 39 6093
263 469 6100
264 300 6167
265 183 6178
266 250 6195
267 49 6231
268 471 6251
269 334 6283
270 265 6422
271 407 6468
272 252 6472
273 466 6478
274 227 6528
275 102 6550
276 458 6653
277 140 6667
278 133 6668
279 493 6716
280 465 6729
281 370 6751
282 244 6772
283 216 6772
284 488 6773
285 95 6777
286 52 6788
287 57 6821
288 289 6846
289 362 6939
290 180 6944
291 324 6961
292 211 7012
293 73 7034
294 301 7094
295 23 7106
296 64 7169
297 420 7182
298 36 7219
299 376 7257
300 484 7265
301 253 7275
302 470 7312
303 460 7405
304 98 7425
305 302 7427
306 393 7435
307 159 7554
308 237 7564
309 274 7701
310 359 7769
311 68 7779
312 483 7829
313 151 7910
314 186 7948
315 442 7952
316 259 8049
317 246 8128
318 96 8129
319 271 8176
320 438 8190
321 87 8197
322 162 8226
323 489 8260
324 418 8312
325 31 8504
326 179 8532
327 79 8578
328 226 8600
329 27 8719
330 479 8862
331 268 8883
332 404 8908
333 46 8913
334 437 8961
335 147 9047
336 189 9164
337 20 9242
338 386 9356
339 435 9376
340 432 9495
341 408 9505
342 248 9509
343 462 9619
344 229 9774
345 193 9835
346 167 9871
347 69 9894
348 130 9954
349 327 10072
350 369 10078
351 106 10180
352 194 10212
353 325 10306
354 312 10344
355 303 10502
356 184 10655
357 463 10916
358 426 11055
359 283 11334
360 328 11450
361 129 11467
362 288 11806
363 124 12010
364 171 12250
365 121 12257
366 22 12276
367 423 12310
368 192 12313
369 378 12358
370 307 12366
371 143 12678
372 80 12899
373 66 12920
374 208 12970
375 158 13131
376 148 13423
377 119 13723
378 317 13830
379 395 13834
380 187 14003
381 25 14856
382 399 14905
383 478 16145
384 93 20009
385 215 20723
{code}
This task will aim to improve the preprocessing algorithm to yield a more even
slide distribution in the final image dataset, hopefully reducing the batch
effects, and leading to improved model metric performance.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)