Sylvain Zimmer created SPARK-16826: -------------------------------------- Summary: java.util.Hashtable limits the throughput of PARSE_URL() Key: SPARK-16826 URL: https://issues.apache.org/jira/browse/SPARK-16826 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.0.0 Reporter: Sylvain Zimmer
Hello! I'm using {c4.8xlarge} instances on EC2 with 36 cores and doing lots of {parse_url(url, "host")} in Spark SQL. Unfortunately it seems that there is an internal thread-safe cache in there, and the instances end up being 90% idle. When I view the thread dump for my executors, most of the 36 cores are in status "BLOCKED", in that stage: {code} java.util.Hashtable.get(Hashtable.java:362) java.net.URL.getURLStreamHandler(URL.java:1135) java.net.URL.<init>(URL.java:599) java.net.URL.<init>(URL.java:490) java.net.URL.<init>(URL.java:439) org.apache.spark.sql.catalyst.expressions.ParseUrl.getUrl(stringExpressions.scala:731) org.apache.spark.sql.catalyst.expressions.ParseUrl.parseUrlWithoutKey(stringExpressions.scala:772) org.apache.spark.sql.catalyst.expressions.ParseUrl.eval(stringExpressions.scala:785) org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificPredicate.eval(Unknown Source) org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$$anonfun$create$2.apply(GeneratePredicate.scala:69) org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$$anonfun$create$2.apply(GeneratePredicate.scala:69) org.apache.spark.sql.execution.FilterExec$$anonfun$17$$anonfun$apply$2.apply(basicPhysicalOperators.scala:203) org.apache.spark.sql.execution.FilterExec$$anonfun$17$$anonfun$apply$2.apply(basicPhysicalOperators.scala:202) scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:463) org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:147) org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79) org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47) org.apache.spark.scheduler.Task.run(Task.scala:85) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) java.lang.Thread.run(Thread.java:745) {code} However, when I switch from 1 executor with 36 cores to 9 executors with 4 cores, throughput is almost 10x higher and the CPUs are back at ~100% use. Thanks! -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org