[ 
https://issues.apache.org/jira/browse/SPARK-7564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-7564.
-------------------------------------
    Resolution: Duplicate

> performance bottleneck in SparkSQL using columnar storage
> ---------------------------------------------------------
>
>                 Key: SPARK-7564
>                 URL: https://issues.apache.org/jira/browse/SPARK-7564
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.2.1, 1.3.1
>         Environment: 3 node cluster, each with 100g RAM and 40 cores
>            Reporter: Noam Barkai
>         Attachments: worker profiling showing the bottle-neck.png
>
>
> query over a table that's fully cached in memory, coming from columnar 
> storage creates surprisingly slow performance. The query is a simple SELECT 
> over a 10Gb table that sits comfortably in memory (Storage tab in Spark UI 
> affirms this). All operations are over memory, no shuffle is taking place 
> (again, seen via Spark UI).
> Looking at profiling it seems almost all worker threads are in one of two 
> states:
> 1) either trying to acquire an instance of a Kryo serializer from the pool in 
> SparkSqlSerializer, like so:
> java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:361)
> com.twitter.chill.ResourcePool.borrow(ResourcePool.java:35)
> org.apache.spark.sql.execution.SparkSqlSerializer$.acquireRelease(SparkSqlSerializer.scala:82)
> ...
> org.apache.spark.sql.columnar.InMemoryColumnarTableScan$$anonfun$9$$anonfun$14$$anon$2.next(InMemoryColumnarTableScan.scala:279)
> 2) or trying to release one:
> java.util.concurrent.ArrayBlockingQueue.offer(ArrayBlockingQueue.java:298)
> com.twitter.chill.ResourcePool.release(ResourcePool.java:50)
> org.apache.spark.sql.execution.SparkSqlSerializer$.acquireRelease(SparkSqlSerializer.scala:86)
> ...
> org.apache.spark.sql.columnar.InMemoryColumnarTableScan$$anonfun$9$$anonfun$14$$anon$2.next(InMemoryColumnarTableScan.scala:279)
> Issue appears when caching is done for data coming from columnar storage - I 
> was able to reproduce this using both ORC and Parquet.
> When data is loaded from a parallel tsv text file issue does not occur.
> It seems to be related to de/serialization calls done via 
> InMemoryColumnarTableScan. 
> The code I'm using (running from Spark shell):
> val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
> hiveContext.sql("CACHE TABLE cached_tbl AS SELECT * FROM tbl1 ORDER BY 
> col1").collect()
> hiveContext.sql("select col1, col2, col3 from cached_tbl").collect
> It seems that possibly the usage of KryoResourcePool in SparkSqlSerializer 
> causes contention in the underlying ArrayBlockingQueue. A possible fix might 
> be to replace this data structure with something more "multi-thread friendly"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to