Javiermateor commented on PR #2287:
URL: https://github.com/apache/systemds/pull/2287#issuecomment-3117629632

   Hi @christinadionysio ,
   
   Thanks for the review. I've completed some tests, and the results confirm 
our suspicions about GPU compatibility.
   
   To ensure the CI passes for this PR, I have reverted the script to use 
`source("scripts/nn/layers/batch_norm2d_old.dml") as bn2d`. 
   We used in this case `nn/layers/batch_norm2d.dml` in the import of Resnet 
for the GPU training. This one is a wrapper for a built-in C++/Java operator 
and offloads the entire computation to a highly optimized, low-level function 
within the SystemDS engine whereas the original 
source("scripts/nn/layers/batch_norm2d_old.dml") performs all the mathematical 
steps for batch normalization (calculating means, variances, normalizing, etc.) 
using high-level DML functions like colMeans, rowMeans, and custom helper 
functions.
   
   This allows the existing dml-based tests to succeed.
   
   The main focus of my investigation was the script's behavior on the GPU 
backend, and here are the key findings:
   
   The core issue is the `batch_norm2d_old.dml` script's incompatibility with 
GPU execution.
   
   *   When running the ResNet example with the `-gpu` flag, training 
consistently fails during the first parameter update step with a 
`NullPointerException` related to sparse matrix operations:
       ```
       NullPointerException -- Cannot invoke 
"org.apache.sysds.runtime.data.SparseBlock.size(int)" because 
"sparse_inputs[0]" is null.
       ```
   
   *   This is the exact same error pattern we documented with the AlexNet 
implementation, which points to a systemic issue with older, pure-DML scripts 
not being adapted for the GPU backend's data structures.
   
   The full command we used is:
   
       ```
       java --add-modules jdk.incubator.vector -Xmx8g -Xms8g -Xmn400m -cp 
"target/systemds-3.4.0-SNAPSHOT.jar:target/lib/*" 
org.apache.sysds.api.DMLScript -f scripts/nn/examples/mnist_resnet.dml -exec 
singlenode -gpu
       ```
   
   ### Recommendation for a Future-Proof Fix
   
   To enable stable and performant GPU training for ResNet, I guess we must 
migrate it to use the modern, built-in operator: `nn/layers/batch_norm2d.dml`. 
   
   During testing, I found that this migration will also require a minor, 
related fix in `resnet.dml` (initializing convolution biases with `matrix(0.0, 
...)` instead of `matrix(0, ...)`). This resolves a separate 
`NullPointerException` on the CPU backend that the new, stricter operator 
exposes.
   
   Given that the proper fix involves a dependency change and touches on GPU 
compatibility, I've kept this PR minimal to ensure the current CI pipeline 
passes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@systemds.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to