Re: Get size of rdd in memory

Cheng Lian Fri, 30 Jan 2015 19:56:08 -0800

Here is a toy |spark-shell| session snippet that can show the memoryconsumption difference:


|import  org.apache.spark.sql.SQLContext
import  sc._


val  sqlContext  =  new  SQLContext(sc)
import  sqlContext._

setConf("spark.sql.shuffle.partitions","1")

case  class  KV(key:Int, value:String)

parallelize(1  to1024).map(i =>KV(i, i.toString)).toSchemaRDD.cache().count()
parallelize(1  to1024).map(i =>KV(i, i.toString)).cache().count()
|

You may see the result from the storage page of the web UI. It suggeststhe in-memory columnar version uses 11.6KB while the raw RDD versionuses 76.6KB on my machine.

Not quite sure how to do the comparison programmatically. You can trackthe data source of the “Size in Memory” field showed in the web UIstorage tab.


Cheng

On 1/30/15 6:15 PM, ankits wrote:

Hi,

I want to benchmark the memory savings by using the in-memory columnar
storage for schemardds (using cacheTable) vs caching the SchemaRDD directly.
It would be really helpful to be able to query this from the spark-shell or
jobs directly. Could a dev point me to the way to do this? From what I
understand i will need a reference to the block manager, or something like
RDDInfo.fromRdd(rdd).memSize.

I could use reflection or whatever to override the private access modifiers.



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Get-size-of-rdd-in-memory-tp10366.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: Get size of rdd in memory

Reply via email to