[ 
https://issues.apache.org/jira/browse/SPARK-16826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sylvain Zimmer updated SPARK-16826:
-----------------------------------
    Description: 
Hello!

I'm using {{c4.8xlarge}} instances on EC2 with 36 cores and doing lots of 
{{parse_url(url, "host")}} in Spark SQL.

Unfortunately it seems that there is an internal thread-safe cache in there, 
and the instances end up being 90% idle.

When I view the thread dump for my executors, most of the executor threads are 
"BLOCKED", in that state:
{code}
java.util.Hashtable.get(Hashtable.java:362)
java.net.URL.getURLStreamHandler(URL.java:1135)
java.net.URL.<init>(URL.java:599)
java.net.URL.<init>(URL.java:490)
java.net.URL.<init>(URL.java:439)
org.apache.spark.sql.catalyst.expressions.ParseUrl.getUrl(stringExpressions.scala:731)
org.apache.spark.sql.catalyst.expressions.ParseUrl.parseUrlWithoutKey(stringExpressions.scala:772)
org.apache.spark.sql.catalyst.expressions.ParseUrl.eval(stringExpressions.scala:785)
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificPredicate.eval(Unknown
 Source)
org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$$anonfun$create$2.apply(GeneratePredicate.scala:69)
org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$$anonfun$create$2.apply(GeneratePredicate.scala:69)
org.apache.spark.sql.execution.FilterExec$$anonfun$17$$anonfun$apply$2.apply(basicPhysicalOperators.scala:203)
org.apache.spark.sql.execution.FilterExec$$anonfun$17$$anonfun$apply$2.apply(basicPhysicalOperators.scala:202)
scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:463)
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:147)
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
org.apache.spark.scheduler.Task.run(Task.scala:85)
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
java.lang.Thread.run(Thread.java:745)
{code}

However, when I switch from 1 executor with 36 cores to 9 executors with 4 
cores, throughput is almost 10x higher and the CPUs are back at ~100% use.

Thanks!

  was:
Hello!

I'm using {{c4.8xlarge}} instances on EC2 with 36 cores and doing lots of 
{{parse_url(url, "host")}} in Spark SQL.

Unfortunately it seems that there is an internal thread-safe cache in there, 
and the instances end up being 90% idle.

When I view the thread dump for my executors, most of the 36 cores are in 
status "BLOCKED", in that state:
{code}
java.util.Hashtable.get(Hashtable.java:362)
java.net.URL.getURLStreamHandler(URL.java:1135)
java.net.URL.<init>(URL.java:599)
java.net.URL.<init>(URL.java:490)
java.net.URL.<init>(URL.java:439)
org.apache.spark.sql.catalyst.expressions.ParseUrl.getUrl(stringExpressions.scala:731)
org.apache.spark.sql.catalyst.expressions.ParseUrl.parseUrlWithoutKey(stringExpressions.scala:772)
org.apache.spark.sql.catalyst.expressions.ParseUrl.eval(stringExpressions.scala:785)
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificPredicate.eval(Unknown
 Source)
org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$$anonfun$create$2.apply(GeneratePredicate.scala:69)
org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$$anonfun$create$2.apply(GeneratePredicate.scala:69)
org.apache.spark.sql.execution.FilterExec$$anonfun$17$$anonfun$apply$2.apply(basicPhysicalOperators.scala:203)
org.apache.spark.sql.execution.FilterExec$$anonfun$17$$anonfun$apply$2.apply(basicPhysicalOperators.scala:202)
scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:463)
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:147)
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
org.apache.spark.scheduler.Task.run(Task.scala:85)
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
java.lang.Thread.run(Thread.java:745)
{code}

However, when I switch from 1 executor with 36 cores to 9 executors with 4 
cores, throughput is almost 10x higher and the CPUs are back at ~100% use.

Thanks!


> java.util.Hashtable limits the throughput of PARSE_URL()
> --------------------------------------------------------
>
>                 Key: SPARK-16826
>                 URL: https://issues.apache.org/jira/browse/SPARK-16826
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.0.0
>            Reporter: Sylvain Zimmer
>
> Hello!
> I'm using {{c4.8xlarge}} instances on EC2 with 36 cores and doing lots of 
> {{parse_url(url, "host")}} in Spark SQL.
> Unfortunately it seems that there is an internal thread-safe cache in there, 
> and the instances end up being 90% idle.
> When I view the thread dump for my executors, most of the executor threads 
> are "BLOCKED", in that state:
> {code}
> java.util.Hashtable.get(Hashtable.java:362)
> java.net.URL.getURLStreamHandler(URL.java:1135)
> java.net.URL.<init>(URL.java:599)
> java.net.URL.<init>(URL.java:490)
> java.net.URL.<init>(URL.java:439)
> org.apache.spark.sql.catalyst.expressions.ParseUrl.getUrl(stringExpressions.scala:731)
> org.apache.spark.sql.catalyst.expressions.ParseUrl.parseUrlWithoutKey(stringExpressions.scala:772)
> org.apache.spark.sql.catalyst.expressions.ParseUrl.eval(stringExpressions.scala:785)
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificPredicate.eval(Unknown
>  Source)
> org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$$anonfun$create$2.apply(GeneratePredicate.scala:69)
> org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$$anonfun$create$2.apply(GeneratePredicate.scala:69)
> org.apache.spark.sql.execution.FilterExec$$anonfun$17$$anonfun$apply$2.apply(basicPhysicalOperators.scala:203)
> org.apache.spark.sql.execution.FilterExec$$anonfun$17$$anonfun$apply$2.apply(basicPhysicalOperators.scala:202)
> scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:463)
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
> scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:147)
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
> org.apache.spark.scheduler.Task.run(Task.scala:85)
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> java.lang.Thread.run(Thread.java:745)
> {code}
> However, when I switch from 1 executor with 36 cores to 9 executors with 4 
> cores, throughput is almost 10x higher and the CPUs are back at ~100% use.
> Thanks!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to