Javiermateor commented on PR #2287: URL: https://github.com/apache/systemds/pull/2287#issuecomment-3117629632
Hi @christinadionysio , Thanks for the review. I've completed some tests, and the results confirm our suspicions about GPU compatibility. To ensure the CI passes for this PR, I have reverted the script to use `source("scripts/nn/layers/batch_norm2d_old.dml") as bn2d`. We used in this case `nn/layers/batch_norm2d.dml` in the import of Resnet for the GPU training. This one is a wrapper for a built-in C++/Java operator and offloads the entire computation to a highly optimized, low-level function within the SystemDS engine whereas the original source("scripts/nn/layers/batch_norm2d_old.dml") performs all the mathematical steps for batch normalization (calculating means, variances, normalizing, etc.) using high-level DML functions like colMeans, rowMeans, and custom helper functions. This allows the existing dml-based tests to succeed. The main focus of my investigation was the script's behavior on the GPU backend, and here are the key findings: The core issue is the `batch_norm2d_old.dml` script's incompatibility with GPU execution. * When running the ResNet example with the `-gpu` flag, training consistently fails during the first parameter update step with a `NullPointerException` related to sparse matrix operations: ``` NullPointerException -- Cannot invoke "org.apache.sysds.runtime.data.SparseBlock.size(int)" because "sparse_inputs[0]" is null. ``` * This is the exact same error pattern we documented with the AlexNet implementation, which points to a systemic issue with older, pure-DML scripts not being adapted for the GPU backend's data structures. The full command we used is: ``` java --add-modules jdk.incubator.vector -Xmx8g -Xms8g -Xmn400m -cp "target/systemds-3.4.0-SNAPSHOT.jar:target/lib/*" org.apache.sysds.api.DMLScript -f scripts/nn/examples/mnist_resnet.dml -exec singlenode -gpu ``` ### Recommendation for a Future-Proof Fix To enable stable and performant GPU training for ResNet, I guess we must migrate it to use the modern, built-in operator: `nn/layers/batch_norm2d.dml`. During testing, I found that this migration will also require a minor, related fix in `resnet.dml` (initializing convolution biases with `matrix(0.0, ...)` instead of `matrix(0, ...)`). This resolves a separate `NullPointerException` on the CPU backend that the new, stricter operator exposes. Given that the proper fix involves a dependency change and touches on GPU compatibility, I've kept this PR minimal to ensure the current CI pipeline passes. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@systemds.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org