Hi, I have been using a standalone spark cluster (v1.4.x) with the following configurations. 2 nodes with 1 core each and 4g memory workers in each node. So I had 2 executors for my app with 2 cores and 8g memory in total.
I have a table in a MySQL database which has around 10million rows. It has around 10 columns with integer, string and date types. (say table1 with column c1 to c10) I run the following query, 1. select count(*) from table1 - completes within seconds 2. select c1, count(*) from table1 group by c1 - complete within seconds but more than the 1st query 3. select c1, c2, count(*) from table1 group by c1, c2 - same behavior as Q2 4. select c1, c2, c3, c4, count(*) from table1 group by c1, c2, c3, c4 - took a few minutes to finish 5. select c1, c2, c3, c4, count(*) from table1 group by c1, c2, c3, c4, *c5 *-* Executor goes OOM within a few minutes!!! *(this has one more column for group by statement) It seemed like the more the group by columns added, the time grows *exponentially!* Is this the expected behavior? I was monitoring the MySQL process list, and observed that the data was transmitted to the executors within a few seconds without an issue. NOTE: I am not using any partition columns here. So, AFAIU essentially there's only a single partition for the JDBC RDD I ran the same query (query 5) in MySQL console and I was able to get a result with in 3 minutes!!! So, I'm wondering what could have been the issue here. This OOM exception is actually a blocker! Are there any other tuning I should do? And it certainly worries me to see that MySQL gave a significantly fast result than Spark here! Look forward to hearing from you! Best -- Niranda Perera @n1r44 <https://twitter.com/N1R44> +94 71 554 8430 https://www.linkedin.com/in/niranda https://pythagoreanscript.wordpress.com/
