felipepessoto opened a new issue, #12387:
URL: https://github.com/apache/gluten/issues/12387

   ### Backend
   
   VL (Velox)
   
   ### Bug description
   
   ## Bug description
   
   **Expected:** Reading from / deleting from a large Delta table that has 
deletion vectors (DVs) completes within a bounded, reasonable memory footprint. 
Vanilla Spark runs Delta's own "huge table" DV tests fine with a 1 GB test heap 
(`-Xmx1024m`).
   
   **Actual:** Under the Gluten Velox bundle, the same reads grow the JVM's 
**native** (off-heap) memory monotonically until the kernel/cgroup OOM-kills 
the process. On Delta's synthetic 2-billion-row DV table the forked test JVM 
climbs to ~13 GB RSS even though its JVM heap is only `-Xmx2G`, i.e. ~11 GB is 
native (Velox), not heap. The growth tracks the duration of a single DV read 
over the huge table, which points at unbounded native materialization on the DV 
/ metadata-row-index read path rather than normal query working set.
   
   Concretely, two Delta tests reproduce it (suite 
`org.apache.spark.sql.delta.deletionvectors.DeletionVectorsSuite`):
   - `huge table: read from tables of 2B rows with existing DV of many zeros`
   - `huge table: delete a small number of rows from tables of 2B rows with DVs`
   
   Both operate on the suite's 2B-row `table5`. The read test alone grew the 
fork from ~5.9 GB to ~13.3 GB over ~13 minutes before the OOM-kill.
   
   Likely area: native row-index materialization on the DV read path. Delta DV 
reads use the metadata row index 
(`spark.databricks.delta.deletionVectors.useMetadataRowIndex`, default true), 
and Gluten offloads that path to Velox (apache/gluten #12269 only falls back 
DML DV scans when `useMetadataRowIndex=false`, so the default read path stays 
native). A maintainer with Velox memory-tracking context should confirm the 
exact allocation site and whether it can be bounded/spilled.
   
   ## Gluten version
   main branch
   
   ## Spark version
   spark-4.0.x (actually Spark 4.1.0 -- Delta 4.2.0's default; the form has no 
4.1 option)
   
   ## Spark configurations
   
   From the Delta-on-Gluten test harness (patched `DeltaSQLCommandTest`):
   
       spark.plugins                    = org.apache.gluten.GlutenPlugin
       spark.shuffle.manager            = 
org.apache.spark.shuffle.sort.ColumnarShuffleManager
       spark.memory.offHeap.enabled     = true
       spark.memory.offHeap.size        = 2g
       spark.gluten.sql.columnar.backend.velox... (default bundle config)
       Delta 4.2.0, Scala 2.13, JDK 17
   
   (The forked test JVM heap is -Xmx2G; off-heap is capped at 2g, yet native 
RSS still reaches ~13 GB -- the allocation appears untracked / not honoring the 
off-heap cap.)
   
   ## System information
   CI runner: ubuntu-22.04 host, ~16 GB RAM, container 
apache/gluten:centos-9-jdk17. Not run via dev/info.sh (observed in CI).
   
   ## Relevant logs
   
   Evidence from the Delta Spark UT (Gluten) pipeline, run 28071158711, shard 2 
(job 83108337324). Per-minute memory profiler during the "read from tables of 
2B rows with existing DV of many zeros" test (p1289 = forked test JVM with 
-Xmx2G; p382 = sbt launcher):
   
       MEM cgroup=12.53G JVMs=[2664M(p382) 5869M(p1289)]
       MEM cgroup=13.70G JVMs=[2664M(p382) 7777M(p1289)]
       MEM cgroup=13.97G JVMs=[2664M(p382) 8431M(p1289)]
       MEM cgroup=14.32G JVMs=[2623M(p382) 11629M(p1289)]
       MEM cgroup=14.77G JVMs=[1815M(p382) 13122M(p1289)]   <- fork ~13.1G RSS, 
heap only 2G
       MEM cgroup=14.91G JVMs=[1879M(p382) 13303M(p1289)]
       Warning: Unable to read from client ...                 <- fork 
OOM-killed here
       MEM cgroup=1.92G  JVMs=[1902M(p382)]                    <- fork gone; 
cgroup drops ~13G
   
   After the kernel killed the fork, sbt wedged on the dead fork (no hs_err, no 
heap dump -- the signature of a kernel/cgroup OOM-kill rather than a JVM OOM), 
and a hang watchdog had to kill the shard after ~16 minutes of silence.
   
   ## Reproduction
   1. Build the Gluten Velox bundle (Spark 4.1 + Scala 2.13 + JDK 17, Delta 
profile).
   2. Run delta-io/delta v4.2.0 `DeletionVectorsSuite` with the Gluten plugin 
enabled (`spark.plugins=org.apache.gluten.GlutenPlugin`), e.g. the two "huge 
table ... 2B rows ... DV" tests above.
      - Equivalent minimal repro: with Gluten Velox enabled, run a count/sum 
scan over a Delta table of billions of rows that carries deletion vectors; 
watch native RSS grow without bound.
   
   ## Impact / workaround
   - Makes large-table DV reads unusable under Gluten Velox (native memory 
blows up and the process is OOM-killed).
   - In the Delta CI pipeline (apache/gluten PR #12278) these two tests are 
force-failed in `setup-delta.sh` to keep the shard from OOM-hanging. That 
workaround should be removed once this is fixed.
   
   
   This was written with the assistance of AI tooling.
   
   ### Gluten version
   
   main branch
   
   ### Spark version
   
   None
   
   ### Spark configurations
   
   _No response_
   
   ### System information
   
   _No response_
   
   ### Relevant logs
   
   ```bash
   https://github.com/apache/gluten/actions/runs/28071158711/job/83108337324
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to