khemchand-zetta commented on PR #8931:
URL: 
https://github.com/apache/incubator-gluten/pull/8931#issuecomment-2786471043

   Hi @JkSelf 
   
[bhj_logs.zip](https://github.com/user-attachments/files/19650047/bhj_logs.zip)
   
   I was trying out your PR for BHJ optimization and ran into some issues 
during testing. Here's a summary of my observations:
   
   - I compiled the PR and built the Gluten JAR successfully.  
   - My setup included Java 17, Spark 3.4, and TPC-H data at scale factor 1GB.  
   - I ran a simplified version of TPC-H Q5 (we call it `tpch_5_sim`) using the 
following query:
   
   ```sql
   SELECT 
       SUM(l_extendedprice * (1 - l_discount))
   FROM 
       lineitem, 
       orders
   WHERE 
       l_orderkey = o_orderkey
       AND o_orderdate >= DATE '1994-01-01'
       AND o_orderdate < DATE '1995-01-01';
   ```
   
   - I enabled BHJ with a broadcast threshold of 10GB (Just to force bhj 
happens).
   - I executed the above query 4 times, and each run resulted in a different 
output along with core dumps.
   - Sample error observed:
     ```
     ERROR TaskSchedulerImpl: Lost executor 7 on 172.17.0.3: Command exited 
with code 134  
     WARN TransportChannelHandler: Exception in connection from 
/172.17.0.3:45056  
     java.net.SocketException: Connection reset
     ```
   
   - And in the core dump I found that the error is related 
org.apache.gluten.vectorized.HashJoinBuilder.clearHashTable(J)
   - Seems like there is some sort of memory corruption inside native memory
   - I also tested the same JAR with BHJ **disabled**, and the query ran 
successfully without any issues.
   - For further reference, I am attaching:
     - Logs and Spark event log files for both BHJ-enabled and BHJ-disabled 
runs with your PR build.
     - Logs for BHJ-enabled and BHJ-disabled runs using a JAR compiled from the 
main release branch of `incubator-gluten`, under the same setup.
   
   Application Ids mapping for your reference : 
   
   app-20250408170927-0013  -> Application Id for run with bhj build with your 
pr
   app-20250408171422-0014  -> Application Id for run without bhj (shj) build 
with your pr
   app-20250408171737-0015  -> Application Id for run with bhj with gluten main 
branch release jar
   app-20250408172038-0016  -> Application Id for run without bhj (shj) with 
gluten main branch release jar


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to