khemchand-zetta commented on PR #8931: URL: https://github.com/apache/incubator-gluten/pull/8931#issuecomment-2786471043
Hi @JkSelf [bhj_logs.zip](https://github.com/user-attachments/files/19650047/bhj_logs.zip) I was trying out your PR for BHJ optimization and ran into some issues during testing. Here's a summary of my observations: - I compiled the PR and built the Gluten JAR successfully. - My setup included Java 17, Spark 3.4, and TPC-H data at scale factor 1GB. - I ran a simplified version of TPC-H Q5 (we call it `tpch_5_sim`) using the following query: ```sql SELECT SUM(l_extendedprice * (1 - l_discount)) FROM lineitem, orders WHERE l_orderkey = o_orderkey AND o_orderdate >= DATE '1994-01-01' AND o_orderdate < DATE '1995-01-01'; ``` - I enabled BHJ with a broadcast threshold of 10GB (Just to force bhj happens). - I executed the above query 4 times, and each run resulted in a different output along with core dumps. - Sample error observed: ``` ERROR TaskSchedulerImpl: Lost executor 7 on 172.17.0.3: Command exited with code 134 WARN TransportChannelHandler: Exception in connection from /172.17.0.3:45056 java.net.SocketException: Connection reset ``` - And in the core dump I found that the error is related org.apache.gluten.vectorized.HashJoinBuilder.clearHashTable(J) - Seems like there is some sort of memory corruption inside native memory - I also tested the same JAR with BHJ **disabled**, and the query ran successfully without any issues. - For further reference, I am attaching: - Logs and Spark event log files for both BHJ-enabled and BHJ-disabled runs with your PR build. - Logs for BHJ-enabled and BHJ-disabled runs using a JAR compiled from the main release branch of `incubator-gluten`, under the same setup. Application Ids mapping for your reference : app-20250408170927-0013 -> Application Id for run with bhj build with your pr app-20250408171422-0014 -> Application Id for run without bhj (shj) build with your pr app-20250408171737-0015 -> Application Id for run with bhj with gluten main branch release jar app-20250408172038-0016 -> Application Id for run without bhj (shj) with gluten main branch release jar -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
